ResPCT: Fast Checkpointing in Non-volatile Memory for Multi-threaded Applications



- Hide Paper Summary
Paper Title: ResPCT: Fast Checkpointing in Non-volatile Memory for Multi-threaded Applications
Link: https://dl.acm.org/doi/10.1145/3492321.3519590
Year: EuroSys 2022
Keyword: NVM; ResPCT; Undo Logging; Memory Snapshot; Epoch-Based Snapshot



Back

Highlights:

  1. In-cache line logging places the data item and the undo log entry in the same cache line, such that either these two are written back to the NVM atomically, or neither of them is written back. This approach saves the persist barrier between the log entry and the data item in conventional undo logging.

  2. In software, the global consistent state can be captured by stopping all threads at their local consistent points (i.e., points in the code where global states are not being mutated). The local consistent points can be manually marked by programmers with little effort, especially for lock-based data structures.

Comments:

  1. How does the recovery procedure know the full set of variables to look for? For pointer-based data structures, this process is easy because you can just traverse the entire structure to figure out all the shared states. But what if the states are not connected by pointers? Do we need a central repository that holds the addresses of all shared variables?

This paper presents ResPCT, a software epoch-based non-volatile memory snapshotting framework. ResPCT protects the integrity of shared data from system failures by periodically checkpointing the states of variables to the NVM. To maximize performance while minimizing programmer effort, ResPCT adopts in-cache line undo logging and leverages programmers’ notations to mark checkpoint boundaries. ResPCT achieves only marginal performance loss while providing the benefit of failure atomicity. Compared with prior works, ResPCT also achieves a significant reduction in runtime overhead due to its in-cache line logging model.

ResPCT aims to provide an epoch-based checkpointing model for lock-based data structures. In the epoch-based model, the multithreaded execution is divided into non-disjoint intervals, called epochs, and the state of shared data is persisted to the NVM at the end of an epoch and before the next one starts. On crash recovery, the system finds the latest persisted epoch and then restores the system state to the one recorded by that epoch. ResPCT assumes that the application periodically creates a checkpoint using an interface function it provides. ResPCT also assumes that the application programmer can explicitly mark the Restart Points (RP) to which the system state can potentially be restored. In the case of multithreaded applications, this model implies that the checkpoint is taken when all threads reach the next RP after the checkpoint request is made. For common use cases such as data structures, the RPs can be easily identified at some points in the code region that is not within a critical section (as the code within critical sections will mutate shared data and leave it in a temporarily inconsistent state). Currently, ResPCT only supports critical section-based use cases. Lock-free data structures, for example, are not supported because ResPCT’s internal logging API is not safe for concurrent invocations.

Prior works that leverage undo logging often exhibit low performance due to the inevitable usage of persist barriers. A persist barrier consists of one or more cache line flushes or write backs, followed by a memory fence, which maps to clflush/clwb and sfense on x86 architecture, respectively. Persist barriers are detrimental to performance because excessive cache line flushes will contend memory bandwidth with regular operations, and the store fence forces the processor pipeline to stall until the flushes are completed. Unfortunately, software implementations of undo logging often involve one persist barrier per memory write within an epoch, i.e., when a cache block is first time written in the epoch, the undo log entry is generated and then immediately flushed into the NVM log buffer using a persist barrier. The actual store can only be performed after the barrier since undo logging enforces the write ordering between log write and data write.

ResPCT gets rid of the excessive persist barrier of undo logging using a technique called in-cache line logging (InCLL). In ResPCT, persistent variables are wrapped by templates that generate two other implicit variables. The first implicit variable is the undo entry of the same type. The second implicit variable is an epoch number indicating the epoch that the undo entry belongs to. The template is carefully crafted such that the compiler will put the three variables on the same cache block. On x86 architecture (and many other architectures), since the basic unit of persistence is no less than a single cache line, such a data layout can guarantee that the three variables will always be persisted atomically.

ResPCT maintains a global epoch counter that indicates the current global epoch number. The epoch counter will be incremented by one when the global epoch advances by calling the checkpoint function. When a variable is being modified, the template wrapper function will first check whether the epoch of the variable equals the current global epoch. If true, then no logging needs to be done, as the variable has already been modified under the same epoch. Otherwise, the wrapper function will copy the old value of the variable into the undo entry and then update the variable’s epoch value to the current global epoch, hence generating the undo log entry. Since the undo entry and the data item are on the same cache block, they will either be written back to the NVM together, or the value on the NVM remains unchanged since the last modification. In the former case, recovery is performed by copying the undo log entry back to the data item. In the latter case, no recovery is needed as the updated value in the current epoch is never written back to the NVM.

Each thread in ResPCT also maintains a thread-local list that stores the addresses of variables that have been modified during the current epoch. When a log entry is generated, the software wrapper inserts the address of the variable into the list. The list is traversed during epoch commit in order to flush all modifications back to the NVM.

As mentioned earlier, the current epoch is committed to the NVM when a thread calls the checkpoint function, which sets a global flag timer to indicate that threads should checkpoint their modified states. In this case, all threads, excluding the caller, must reach a global stop at the next RP. To achieve this, each thread maintains a local flag variable to indicate whether the thread has entered the RP. When a thread calls the RP function, it first checks whether the global timer flag is set, i.e., whether a checkpoint is requested. If true, the thread will traverse its local list of modified variables and flushes them back by issuing a persist barrier. After the persist barrier returns, the thread sets its local flag to signal the initiator thread that the checkpoint has been completed. The initiator of the checkpoint, on the other hand, first flushes its local modifications as described earlier, and then keeps polling the local flags of all other threads until all threads have completed the local flush. At this point, the checkpoint has been globally completed and the initiator clears the timer flag after advancing the global epoch counter. The global epoch counter is also persisted to the NVM such that the recovery procedure can read it and decide the last committed epoch. On observing that the timer flag has been reset (in a polling loop), the rest of the threads will return from the RP function and proceed with their execution.

When the system crashes, some modifications after the last committed epoch will be lost due to not being written back to the NVM. However, some other modifications will be reflected in the post-crash NVM image, causing an inconsistent state which needs to be restored. The restoration procedure first determines the last committed epoch by reading the global epoch counter. It then iterates over all shared variables and compares the epoch of the variable with the last committed epoch value. If the two values mismatch, suggesting that the variable contains a newer value not belonging to the epoch, the recovery procedure will then roll back the modifications by copying the undo entry to the variable and then resetting its epoch value to the last committed epoch. After all the variables have been processed, the system state is now reverted to the point of the last committed epoch.

The paper also noted that, by blocking threads in the RP function until the global checkpoint is completed, deadlocks may arise in deadlock-free code as a result of other blocking function calls or condition variables. In these cases, one thread is blocked by a condition that can only be lifted by another thread proceeding to a certain point. However, if the latter thread is blocked in the RP function, it will never proceed to that point as the former thread is also blocked elsewhere and will never enter an RP function. ResPCT addresses this problem by trusting the programmer to insert function calls checkpoint_allow() that exclude the thread that calls blocking functions or wait in condition variables from the global consensus before it is blocked. The thread should also rejoin the global consensus by calling checkpoint_prevent() after it returns from the blocking call. Unfortunately, threads that call checkpoint_allow() must not perform any write to persistent variables between this call and the previous RP function call in all cases. Otherwise, if a checkpoint happens when the thread is blocked, after which the system crashes, these modifications may not be reverted as the global epoch counter had already been advanced to make these local modifications seem committed states. But in reality, these modifications are not guaranteed to be written back to the NVM due to lack of RP calls of the blocked thread.