Atlas: Leveraging Locks for Non-Volatile Memory Consistency

- Hide Paper Summary

Paper Title: Atlas: Leveraging Locks for Non-Volatile Memory Consistency
Link: https://dl.acm.org/citation.cfm?id=2660224
Year: OOPSLA 2014
Keyword: NVM; Critical Section; Undo Logging

Back

This paper introduces Atlas, a runtime library for ensuring persistency on multi-threaded applications. This paper assumes that the program is written data race-free, using locks as the mathod of synchronization. The paper also assumes that the NVM device is attached to the memory bus, and can be allocated in user space via nv_malloc() calls which maps a “persistent region” of user VA onto the device. Stores to the persistent region will not be immediately flushed to the NVM device. Instead, they may remain in the cache as long as the cache controller does not evict the cache block or the programmer does not issue a cache line flush instruction. In this situation, the library must ensure two invariants: First, each individual critical section (and probably the resule of merging several critical section) should be atomic, i.e. either all stores in that critical section is persisted, or none of them is persisted. This implies some form of logging, which has already been addressed in previous publications. Second, after recovery, the system state must be as if the execution stopped at a point where no lock is held, and the memory image reflects all updates till that point. In the case of multiple nested critical sections, extra care must be taken to avoid the system recovering to an inconsistent state, even if all individual critical sections are recovered properly. To see the reason, assume there are two threads, T1 and T2. T1 is exeuting a critical section consisting of two lock acquires: lock(L1), lock(L2), Write(X), unlock(L1), unlock(L2), and T2 executes the following sequence: lock(L1), Read(X), Write(Y = X + 1), unlock(L1). We further assume that during the execution, T2 executes lock(L1) after T1 releases L1, and the system crashes after T2 finishes unlock(L1) (and is persisted) but before T1 could unlock L2. In this example, since T1’s critical has not yet finished execution, it must be rolled back during recovery to ensure the atomicity of the critical section. On the other hand, since T2 has been persisted on the NVM, if the recovery handler commits T2’s critical section (i.e. make it persistent or not undo its changes), then essentially, T2’s critical section wrote a value from nowhere, since the source critical section where the value X is updated has been undone.

The paper first defines FASE (Failure-Atomic Section) as the minimum sequence of executed instructions by a thread in which at least one lock is held for all instructions. This essentially flattens out nested locks, no matter whether they are perfected nested (like transactions in a TM) or not (like the 2PL protocol or hand-over-hand locking protocol used in B+Trees). This suggests that state modifications within the FASE must be treated as a single logging unit. In addition, if thread T2 synchronized with another thread T1 in the runtime (i.e. either unblocked by T1, or acquired a lock that was previously released by T1), then T2’s critical section must logically commit after the corresponding critical section in T1. Note that due to nested critical sections, the release operation may be at the middle of a FASE. In this case, T2’s FASE must be logically serialized after T1’s FASE. Persistent stores outside of critical sections are similarly tracked as if they were made in the next critical section, i.e. they will be part of the post-crash state if so does the next critical section. Note that within a thread, FASEs must always commit in the program order, even if the FASEs do not conflict by the same lock.

Atlas is implemented using undo logging and background log pruning. The entire NVM address space is treated as a memory block from where smaller blocks can be allocated using special allocator interface. To aid crash recovery, each allocation associates an identifier with the address just returned by the allocator, such that after the crash, the recovery code could use the same name to locate the memory block to recover (this name to address table should itself be persisted atomically). In addition, blocks are assumed to consist of objects that are linked using pointers. Every memory block should have at least one entry point, through which all live objects can be accessed. This is extremely important for garbage collection, a task that is necessary for NVM but not for DRAM due to the fact that allocated blocks will not be freed automatically after a crash. The garbage collector simply traverses from the entry point, marks all reachable objects as live, and then reclaims the rest memory in the block (of cuorse, the layout of objects should be exposed to the GC process, e.g. via a per-class header field).

On initialization of the program, the library allocates a log buffer on the NVM. The log buffer consists of n log head objects, where n is the maximum number of threads. Log entries are attached to the corresponding log head object in program order as a linked list. Each log head object also contains a pointer that points to the last valid log entry of the thread, which is used as the starting point of recovery (note that Atlas uses undo, so applying log records means that the modifications are rolled back). Log entries are allocated from a per-thread circular buffer. It is enforced by the allocator that two consecutive log entries be adjacent to each other in the NVM address space, such that if entry i and j are two adjacent entries in program order, their physical addresses must also be in adjacent cache blocks. This property of log entries is critical in ensuring that log records are attached to the end of the block atomically (we assume the “next” pointer of the log entry object is its last field, and is always 8 byte aligned). When a new entry is added, there are two possibilities: either the new record is aligned to cache block boundaries, or it is not. In the former case, the log manager first flushes the new log entry back to the NVM, and then updates the previous entry’s “next” pointer, and then flush the previous entry. This guarantees atomicity, because the content of the new entry is persisted first, followed by the updated pointer that points to it. In the latter case, since the “next” pointer of the previous entry shares a cache block with the new entry, it is sufficient to just flush the cache block. In both cases, the new entry is added atomically.

At runtime, Atlas instruments every store instruction within critical sections, lock acquire and lock release operation to let them call into the library. A log entry is generated and added to the per-thread log for these three types of operations. To ensure correctness of recovery, the undo log entry must persist before the corresponding store. Atlas chooses to flush the log entry immediately after generating them, but uses a different policy for stores, i.e. stores may not be instantly flushed back once they finish in the cache (we will discuss this later). To further track lock acquire-release dependencies, the runtime system also maintains a hash table that records the last log record that corresponds to the most recent release of a lock. When a lock is released, the hash table is updated to reflect the change. When a lock is acquired, the acquisition thread queries the hash table, and stores the pointer to the last lock release log record into its lock acquire entry. This information will be used when computing the persist state.

Atlas also starts a background thread to maintain internal states in the background. One of the most important role this background thread plays is to compute the current persistent image. Recall that the current persistent image consists of only FASEs that are committed and all persistent stores before them. A FASE is committed only when (1) All its stores (and by transitivity, undo records) are persisted to the NVM, and (2) All its dependent FASEs are committed (if any). To compute the persistent image, the background thread scans the log chain of each thread periodically. It identifies FASE using a counter, which is incremented for every lock acquire, and decremented for every lock release. Stores performed while the counter is non-zero are considered as in a FASE (i.e. nested critical sections are flattened out). Besides, stores that are not in any FASE will join the next FASE if there is one, which implies that these stores must not be committed before the next FASE does (otherwise, it is possible that the previous FASE is undone, while stores after it are committed, which is inconsistent). The background thread then builds a graph based on the results of log scan. Nodes of the graph correspond to FASEs (in addition to stores before them), and edges are dependencies. The background thread marks a FASE as “committed” by checking the status of stores within that FASE as well as the status of all its dependencies. If both are satisifed, the FASE is considered as committed, and all its log records are recycled. The background thread atomically updates the log head pointer to let it point to the next log entry that has not been committed, such that during recovery, all stores (no matter they are physically committed or not) are rolled back after this point.

After the system crashes, the recovery handler is invoked. Due to the fact that the background thread only scans the log periodically, some already committed FASEs may not be identified. The recovery handler first runs the same log scanning alrorithm to advance the persistent image as far as possible, discarding log entries not participating the recovery process, and then starts undoing changes. The roll back process begins from the end of the log, and iterates backwards, applying the undo image to the recorded address. Once this process is finished, the system state is equivalent to a point during uninterrupted execution where no lock is held by any thread.

The paper also proposes several policies for persisting stores. The simplest of them is to flush the store right after the log entry is persisted. This scheme works badly on architectures where cache line flush is a synchronous operation (i.e. x86 at the time this paper is written), although it guarantees immediate persistence after the thread exits the FASE. On the other extreme, all stores may be recorded into a list, which is flushed periodically by the background thread (or by a worker thread at the end of one or multiple FASEs). Although some stores might be evicted before this happens, their persistence can only be secured once the background thread flushes the store list. This schemes introduces less overhead per store operation, but huge traffic at the end of a FASE or an epoch. Besides, at recovery time, stores not yet flushed into the NVM are considered as lost, and hence must be rolled back. In the middle, the paper proposes using a software maintained, direct-mapped cache. The software cache is implemented as an array, indexed using a few bits from the address. If the store address hits the cache, and the entry is inserted in the same FASE, no undo image is generated. If the store missed the cache, then the current entry’s address is flushed, and the entry is updated. Stores that are persisted by means of flushing the cache block will be reflected in the log entry (e.g. set a bit). The log scanning algorithm uses this flag to infer whether a FASE has been fully persisted or not.