Failure-Atomic Persistent Memory Updates via JUSTDO Logging



- Hide Paper Summary
Paper Title: Failure-Atomic Persistent Memory Updates via JUSTDO Logging
Link: https://dl.acm.org/citation.cfm?id=2872410
Year: ASPLOS 2016
Keyword: Logging; JUSTDO; NVM



Back

This paper proposes JUSTDO logging, a novel logging and recovery method for non-volatile memory. In ordinary logging schemes, such as undo logging, redo logging, and shadow logging, the amount of data (i.e. log records and metadata) is proportional to the number of store operations. One log entry is generated for each store operation, which contains the address, data, and other control information in order to roll back or redo the affected region. These schemes generally have three problems. First, they enforce write ordering at the guanularity of every store or every transaction commit. In the former case, a log entry is generated before the store is performed, and the entry must be fllu flushed to the NVM device before the store could proceed to guarantee recoverability. Although with new hardware primitives and probably new architectures, the overhead of persisting can be overlapped by smart scheduling (e.g. hardware level write ordering enforcement), it still poses a major overhead of logging schemes that require per-store logging, e.g. undo logging. In the latter case, the log records are flushed to the NVM at transaction commit point, which incurs a burst of traffic to the network and NVM. Even worse, the chance of overlapping this burst of memory write backs with other useful work is slim, because transaction commit always happens as the last action a transaction will take, which puts the persistence of log records on the critical path. This happens with redo logging. The second problem is that extra metadata, either on-chip or off-chip (e.g. on memory controller) might be used to track the state of stores. These metadata themselves also require persistence for correct crash recovery, which introduces extra traffic, storage and complexity. The last problem is that the recovery of all except shadow-mapping are data-centric, which means that the time complexity of recovery depends on the amount of data the current transaction or critical section has generated. The availbility of the system might be affected because of slow recovery. For shadow-mapping, only simple recovery is performed, because by nature, this scheme has a “write-once” property for non-volatile data, and only overwrites old data when it is safe to do so. The downside, however, is that the amount of working data will be multiplied due to multiversioning. The system generally has to stall and perform garbage collection when storage runs out to reuse older versions of data.

Justdo logging takes advantage of two important observations. First, most datarace-free applications use locks as the basic synchroniation primitive. Lock acquisition and release can be interpreted as granting and giving up permission of accessing shared states. Applications often only observe inconsistent states of shared data within critical sections (i.e. holding at least one lock), and leave shared data in consistent states before leaving the critical section. The crash recovery, therefore, could only focus on these (potentially nested) critical sections called “Failure-Atomic Sections (FASE)”, and recover the system back to a state that is equivalent to a time point in normal execution where no lock is held by any thread. The second observation is that, although data-centric recovery may simply the algorithm, because the recovery handler just needs to iterate over all log records, and apply the data to the address indicated by the record, the same sequence of stores can actually be reproduced by running the same FASE with the same input (the paper fails to identify that if the FASE itself is non-determinstic, the sequence could not be reconstructed, in which case re-executing with the same input will not help), which eliminates the need of value logging. As long as the recovery process can resume the inerrupted execution of a FASE with the same environment (e.g. local stack frames, arguments, etc.), it is guaranteed that the same output be generated, and the same state as if the original FASE were not interrupted by the crash can be reached.

This paper also identifies an important trend in the evolution of hardware platform: In the near future, the cache system might also be included into the persistence domain, just as the pending queue in the memory controller. This can be done by battery-backed caches, or using residual/backup power to flush the cache back to the NVM. No matter which technique is used, the illustration of persistent caches can greatly simplify persistent programming, since store instructions can be considered of persistent, as long as the instruction has exited the store buffer. This only takes a relatively lightweight store fence (which drains the store buffer, exactly as needed), instead of executing a heavyweight fence-flush-fence sequence, which can take hundreds of cycles.

Justdo logging overcomes the above problems using a combination of persistent cache, exeution-centric recovery, and FASE-based inference. Each thread maintains a small thread-local logging area, which consists of a special logging object (jd_obj in the paper) and other control data. The logging object consists of two log entries, a list of pending lock descriptors, and a list of owned lock descriptors. The two log entries are used alternatively to log the most recent store instruction in the current FASE, including the address, value, and the program counter after the store. We need two entries, because while writing the current log entry, the content of the entry may become temporarily inconsistent. To avoid corrupting the entry, we always keep the previously valid entry clean, and write to the other one (more details later). The two lists of locks are responsible for recoverying lock ownership after a crash. The paper assumes that all locks are mapped to NVM region, and hence their state remains available after the crash (so the lock list only serves as a fast path - we can always find the owner of a lock by scanning the lock table if the lock is implemented with owner information).

FASEs in Justdo logging is wrapped by special wrappers called justdo routines. A justdo routine consists of three parts. The first part is a header macro, JD_ROUTINE_ON_ENTRY. This macro checks the current mode of execution (normal/recovery), and depending on the mode, jumps to the last point (recovery mode) of interruption or does nothing (normal mode). In the second part, the justdo routine copies local states for executing the FASE into the logging object, which will be mapped to persistent storage (i.e. the cache). This is necessary, since in order for the FASE to resume execution, the local state of the FASE must remain accessible after a crash. The third part is the FASE body, which consists of lock acquire, lock release, and other operations. Note that locks do not have to be nested perfectly, as in other FASE-based designs. If the execution is interrupted in the middle of this part, Justdo logging guarantees that the FASE is completed as if the interruption never happened.

In the runtime, Justdo logging works as follows. On every store operation within a FASE (detectable by incrementing/decrementing a counter on lock acquire/release), the library updates the log entry with the information of the store (described above). To ensure atomic update of a log entry, we use two entries (statically allocated in jd_obj) alternatively. The current active entry is marked by a flag in the entry object. When a new log entry is to be written, the library picks the inactive entry, updates its fields, and executes a store barrier to ensure these writes have reached the persistent cache. Then the library sets the flag, and executes a store barrier again. This ensures that the store is only executed after the log entry is written. Lock acquire and release are also instrumented. On a locking operation, before the lock is physically acquired, the library adds the address of the lock into the lock pending list, which indicates that the lock might be acquired by the current thread. After the lock is physically acquired and before any instruction after that is executed, the library adds the lock address into lock ownership list, which indicates that the lock has been owned by the current thread, and if this flag is set, the FASE must execute to completion during recovery. The observation here is that, at any given point, if an address X is in the lock pending list, but not in lock ownership list, the lock may or may not have been acquired by the thread, but it is guaranteed none of the FASE body has been executed, and hence no inconsistency has been introduced. In this case, the FASE can be considered as not executed, and no recovery needs to be done. Similarly, on a lock release, the lock is first removed fron the ownership list, and then from the pending list. If the former is missing but the latter is present, we know that the FASE has been completed already, and there is nothing except releasing the lock that we need to do on recovery.

On recovery, the library starts the same number of recovery threads as before the crash, and assign one thread to a context. The recovery process is naturally multi-threaded because some locks may still be held, threads may block each other during recovery, and therefore the execution should be multi-threaded to avoid deadlock. Each recovery thread scans the lock log to infer lock ownership. First, it frees all locks that are in the pending list, (including those also in the ownership list) by directly writing into the lock object whose address is in the log entry. Threads need to pass a barrier at the end of this stage. Then, threads scan ownership list, and re-acquire locks that are owned by the current thread. Since each lock can at most be owned by one thread at a time, this process is data race-free, and we can always reach a determinatic state, although may not be the exact same state before the crash (e.g. some locks acquired might not be reflected in the ownership list, which is fine, because these threads have not performed any writes anyway). Recovery threads pass a second barrier at the end of this stage. In the third stage, the system enters recovery mode, and the most recent justdo routine is re-executed. Recall that at the beginning of the routine, we call a procedure to determine the current mode. If the system is in recovery mode, instead of proceeding to execute the rest of the routine, the procedure performs recovery by completing the store in the current active log entry, and then jumping directly to the PC stored in the log entry. The recovery thread will run the FASE till the end, and then exit. A counter is used to detect the end of a FASE. The counter records the current number of locks held by the thread, and is incremented or decremented when acquiring and releasing a lock. The current FASE ends when the counter drops to zero.

To ensure that the local state is always consistent with every store, justdo logging requires that every store to the local state must be reflected to the memory. This incurs a non-trivial overhead, because compilers often promote a variable into register to avoid excessive memory writes. With this optimization, recent updates to the variable might be lost during the crash as register values are volatile, and the state of local memory is inconsistent when recovery threads jump to the PC of the most recent store. By disabling register promotion, the execution time is almost doubled as reported in the paper, but this cost is necessary for correctness, and is still lower than logging based approach. Recall that before entering a FASE, the justdo routine has already copied volatile local states into the justdo object. Inside the FASE, all local state updates are perfomed on the non-volatile version of the FASE, and will be used to resume execution during recovery.