NVM Duet: Unified Working Memory and Persistent Store Architecture

- Hide Paper Summary

Paper Title: NVM Duet: Unified Working Memory and Persistent Store Architecture
Link: https://dl.acm.org/citation.cfm?id=2541957
Year: ASPLOS 2014
Keyword: NVM; NVM Duet

Back

This paper proposes NVM Duet, a software-hardware co-design that enables better write performance within NVM by using more flexible scheduling and by taking advantage of the physical property. The discussion is based on byte addressable Phase Change Memory (PCM) installed to the memory bus, which can be accessed in cache line granularity via load and store instructions. The paper identifies two problems that result in prolonged write latency in current NVM architectures. The first problem is that writes are not always scheduled in an optimal manner given the hardware primitive for implementing software persistence barriers. At the time of writing this paper, Intel has proposed using clflush (and later an optimized form, clwb, is added to the ISA), sfence and pcommit (now deprecated) to force data to be written back into the NVM. This persistence barrier serializes all store instructions before the barrier before instructions issued after the barrier, i.e. the latter group must wait for the former to be fully committed into memory before they are. This results in suboptimal scheduling of memory operations given that bank-level parallelism exists and can support multiple active writes, while it is not always possible to find a write to feed a bank in the current epoch due to persistence barrier. In practice, these barriers will not only serialize instructions that we intend to serialize, but also they enforce the ordering on memory instructions that do not have any ordering requirement, e.g. memory instructions from another hardware thread, anothe core, or to memory locations that are not necessarily persistent such as local scratch space and the stack. The second problem is that most workloads do not require all stores to be persistent. For those “volatile” stores, it is unnecessary for them to be retained for days or even months even if the address space is mapped to the NVM. It is a waste of write latency and power to write all data with the same retention, while some of them can be safely lost on a power down without corrupting anything.

Slow writes on PCM-based platforms can cause performance degradation in two aspects. First, since writes are typically slower on PCM than reads, the memory controller will usually prioritize writes over reads, which increases read latency if the write queue is sufficiently full. On the CPU side, reads are often on the critical path of instructions reading from memory. Long read latency can easily become a performance bottleneck if the working set size is larger than the LLC and that cache misses are frequent. Second, if the write queue is full, the LLC eviction will be stalled, since no more entries can be inserted into the write queue. This will further block cache miss handling after the evict queue on LLC side is filled up, which eventually affects performance.

This paper proposes adding a bit vector, called a AllocMap to the memory controller to record which memory page requires data persistence, and which does not. For a 16GB PCM chip, the total space for AllocMap is just 512KB, which can be stored easily by the on-chip SRAM. Initially, all bits are set to “0”. On an operating system page mapping, the user specifies whether a certain page should be mapped as persistent memory, which requires strong ordering indicated by memory barriers, or the page should be mapped as working memory, which just provides run-time storage of volatile data, and might be reset or corrupted after a power loss. In the mmap() system call, the OS uses MMIO to communicate with the memory controller, and sets the corresponding 4KB bit to “1” if it is demanded by user to map it as working memory. Note that the on-chip storage for AllocMap does not need to be persistence. On a restart, the content of the AllocMap is lost, and the bit vector is reinitialized to all zeros. This does not affect correctness, however, as the purpose of the entire AllocMap is to optimize write performance, not to change the semantics or functionality of persistent pages.

With AllocMap on the memory controller, the processor no longer needs to enforce memory ordering using instructions designed for memory consistency such as sfence. Instead, the memory fence is issued to the memory controlle directly such that the controller knows which writes are before which when it schecules writes. When a memory write back is issued by the LLC to the NVM controller, the controller reads the corresponding bit of the page that the cache line resides in, and associates the bit with the write back request before enqueuing the request to the write queue. The write scheduler then schedules write back based on the two following rules. First, writes to working pages (associated with a “1” bit) can be scheduled across memory barriers, since they are considered as volatile. This way, volatile stores in the following epoches can be moved to the current epoch, if one of the memory bank is free and there is no more memory operation for that bank in the current epoch. The second rule is that scheduling priority is always given to stores to persistent pages. This allows the current epoch to be finished earlier and quicker, which may push volatile stores to the next epoch, making it larger, out of which better scheduling decisions can be made because of increased number of options.

The second optimization of NVM Duet is the capability to refresh memory cells only when it is strictly necessary. This is based on the observation that volatile stores do not need to be retained by PCM for very long time after power down, hence having a larger acceptable range of resistence. During normal operation, we monitor the resistence of each cell regularly, and only refresh them (i.e. read out and write back) if the resistence level is close to the threshold for the current value. Otherwise, the cell is not refreshed. Experiments and simulations show that most cells do not need frequent refresh, as the resistence of cells only drift by a small amount during the refresh interval. As a result, with the same hardware and power budget, more power cells can be refreshed in a refresh cycle (some of them does not need any refreshing), resulting in a net reduction in power consumption and write latency.