ATOM: Atomic Durability in Non-Volatile Memory through Hardware Logging



- Hide Paper Summary
Paper Title: ATOM: Atomic Durability in Non-Volatile Memory through Hardware Logging
Link: https://ieeexplore.ieee.org/document/7920839/
Year: HPCA 2017
Keyword: NVM; Undo; ATOM



Back

This paper proposes ATOM, an undo-based design for achieving atomic persistency. Atomic persistency is a programming paradigm in which programmers specify an atomic region, and either all updates within the region are persisted, or none of them is persisted, similar to how atomicity is defined for transactions. Atomic persistency is often implemented in software using undo logging. Before a data item is written, code must be inserted to copy the old value to an undo log entry if the data item is written for the first time in the atomic region. Programmers must also manually issue persistence barriers consisting of cache line write back instructions and memory fences to flush log entries and dirty data items back to the NVM before the atomic region ends. To guarantee that modifications in the atomic region are able to be rolled back on crashes, the undo log entry must be written to the NVM immediately after they are created and before the update is made. Otherwise, the dirty cache line can be evicted back to the NVM at arbitrary time after it is written, and if this happens before the log entry reaches NVM, this dirty line cannot be undone during recovery. At the end of the atomic region, software issues cache line flush instructions to write back all dirty lines.

This paper identifies two potential performance problems in the software-only scheme. The first problem is that by writing the old value into the log and flushing the log entry into the NVM as in undo WAL, data might be read from the NVM just to write it back later unmodified, which wastes memory bandwidth. The second problem is that since software has no fine-grained control over the ordering of NVM write backs, to avoid write ordering violation, the log entry must be written back to the NVM right after they are created. During the write back operation, the processor must stall and wait for the operation to complete, hence putting the slow NVM write back operation on the critical path of execution. In fact, since log entry write back needs to be done for all store operations on the cache line if the line has not been written since the beginning of the atomic region, the processor is unable to overlap memory operations with on-chip computation, which effectively offsets the performance benefit of modern out-of-order superscalar processors.

ATOM solves the above two issues by extending both the cache controller and memory controller to make them aware of the existence of WAL and undo logging. Logging is handled at the hardware level, and therefore the full set of hardware states are available to support more informative decision making and better utilization of hardware resources. ATOM assumes that the system consists of cache coherent multicores, and NVM devices connected to the memory bus. Multiple memory controllers exist to handle memory requests from processors. Each memory controller runs independently based solely on its internal state, and hence does not constitute a performance bottleneck. DRAM caching is not assumed by this paper. The entire address space is mapped to the NVM as is the case with DRAM based system. ATOM performs undo logging in cache line granularity. Programmers issue instructions to enter and terminate atomic regions using two instructions that are similar to hardware transactions. On a failure, it is guaranteed that either modifications in the entire atomic region are preserved, or none of them are present after recovery.

We describe the operation of ATOM as follows. Each cache line in ATOM is extended with one extra bit, the “logged” bit. This bit is cleared whenever the processor enters and exits an atomic region, and set when the cache line is written for the first time. A newly loaded cache line may have this bit set as well depending on the responde from the memory controller which contains a hint of whether the logged bit is set. On store instructions, if the cache line already exists in the cache, the logged bit for the line is checked. If the bit is set, then store is committed as usual, because undo logging only saves the content of the cache line before the first write operation in the atomic region. Otherwise, the cache controller sends the original content of the line as well as relevant metadata to the memory controller as the undo log entry, sets the logged bit of the line, and commits the store instruction.

If, however, the store instruction misses the cache hierarchy and must send a request to the memory controller, logging needs not to be performed by the memory controller. To see this, imagine the read/write sequence of the cache line filling process. If no special hardware is added, the miss request is fulfilled by the memory controller, which reads the NVM and sends back the content of the cache line. On receiving the cache line, the cache controller attachs metadata to the line and sends it immediately back to the memory controller, without modifying the line itself. As pointed out at the beginning of this article, it is a waste of the memory bandwidth to send the content of the cache line twice in a round-trip manner. ATOM proposes the following optimization. During an atomic region, when the cache controller sends a request to the memory controller to fill a missing line, it indicates in the request that undo logging should be performed (called a “read exclusive request” in the paper). On receiving the request, the memory controller generates the log entry and directly forwards the entry to the store queue. The cache line read from NVM is used to generate the entry as well.

To enforce the write ordering between log entry and dirty cache line, the memory controller also locks the cache line when it receives a read exclusive request from the cache controller. The cache line is only unlocked after the corresponding log entry has been persisted (i.e. received an ACK from the memory controller store queue). Processor’s cache line eviction algorithm does not change, and can evict a line whenever it decides to. The memory controller compares the address of the evicted line against the locked address. If they match then the dirty cache line evicted from the processor will be held until the line is unlocked.

On the memory controller side, ATOM defines a set of rules guiding the generation and persistence of log entries. One observation made by the paper is that most NVM controllers support data write on cache line granularity (because processors only request and send data in this granularity). In the case of logging, the logged data itself occupies a line, while metadata must occupy an extra line although it only has one word (the address word). The direct consequence of this is that a log entry persistence operation will be translated into two writes issued to the NVM controller: one for the data and another for metadata. To reduce the overhead, ATOM proposes an optimization that group commits seven log entries as a batch, which consists of eight cache lines: Seven lines for holding data, and the last cache line holds the metadata. An internal buffer in the memory controller stores log entries, until seven are aggregated or a certain amount of time has passed, which is similar to database group commit. Note that under this optimization, at most seven cache lines might be locked at a given moment in time. Any evicted cache line from the processor should be checked against the log entry buffer to ensure that the corresponding log entry has been written back.

ATOM hardware runs on a chunk of NVM memory statically allocated by the operating system at startup time. This chunk of memory can be used as a centralized log to which all processors append their log entries. One potential problem of treating the logging area as one centralized log is that log entries from different processors may become intermingled together. During garbage collection (which can be performed as soon as an atomic regions commits successfully), the memory controller must scan the entire log region and remove entries belonging to the committed atomic regions, which induces both extra read bandwidth and fragmentation. To reduce both, ATOM proposes dividing the log area into separate buckets. Each bucket can only be used by at most one atomic region, such that if the atomic region commits, garbage collection only takes constant time since we just free the buckets. If one transaction uses multiple buckets, they should be linked together like a linked list, with the atomic region’s metadata pointing to both the head and the end. To track bucket usage, the memory controller stores contexts for atomic regions in the system. Before a processor starts an atomic region, it must request a free context from the memory controller. If no context is available the processor must stall. The context of an atomic region consists of a bitmap describing bucket usage, as well as pointers to the head and the end of the log. The log entry buffer described in the previous optimization is also included as part of the context. The context is updated accordingly as log entries are appended. The memory controller also has a set of global states, consisting of a counter of the current number of atomic regions, and a global bitmap describing the usage of all buckets. Bucket allocation and release are simple bitwise operations that can be done in a single cycle. On power failure, the memory controller writes back the global metadata and per-region metadata back to a well-known location on the NVM using a small battery as backup power (which is already in the NVM controller to support flushing the store queue).

On recovery, a special interrupt is raised by the memory controller, and processors must handle the interrupt in software. The software routine is coded as an ISR, which reads the well-known location on the NVM to restore memory controller metadata first, and then use the metadata to locate active atomic regions. After that, the recovery routine identifies uncommitted atomic regions, and for every dirty cache line, reapplies the undo image back to the address recorded in the metadata field. The system is guaranteed to be in a consistent state with all uncommitted atomic regions rolled back, and will resume execution after the recovery process.