Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/7801818
Year: IEEE Computer Architecture Letters 2017
Keyword: Checkpointing; NVM
This paper proposes Survive, a memory checkpointing architecture using hybrid NVM and DRAM system. The biggest difference between Survive and similar systems is that Survive still runs on DRAM just like a conventional DRAM-only system. NVM is deployed as a backend which is invisible to applications and even system libraries. Only during recovery is NVM visible to a system service called the recovery handler. The paper claims two benefits of such a hybrid system. First, writes to NVM can be significantly reduced, because most write operations are filtered out by the DRAM. This reduces both the write latency which is often on the critical path at the end of the checkpoint, and also pressure to the wear leveling mechanism on the NVM since less writes are actually performed. With careful optimization such as reordering and coalescing, write amplification can also be alleviated. The second benefit is that hiding NVM behine DRAM allows simpler hardware designs. There is no need for complicated hardware to enforce proper write ordering, which is a necessity for logging. In addition, most existing components - coherence protocols, eviction policies, directories, etc. can remain unchanged. This implies that no major hardware upgrade is required in order to deploy Survive.
Like many other proposals, Survive adopts an epoch-based recovery paradigm. Normal execution of processors is divided into epoches. At the end of each epoch, all processors are interrupted to establish a checkpoint, which in general consists of three steps. In the first step, processors stall and drain volatile states including the store queue and instruction window. In the second step, processors evict all dirty cache lines in their local caches back to lower level persistent storage. In the last step, processors write their execution context including the register file and some on-chip states also to the persistent storage. After these three steps, a marker is written at the end of the checkpoint, which marks the completion of the checkpoint. Normal execution could resume after thrse three steps, which begins a new epoch. During normal execution, when a data item is written for the first time, an undo log entry is generated which contains the before image of the cache line. The undo log entry is written into the persistent storage before the updated cache line (some hardware is needed to enforce this write ordering). On recovery, the recovery handler first replays the undo log to restore the system state to the beginning of one of the recent checkpoints. Note that not all designs allow us to restore the system state to the most recent checkpoint, because if the persistence of the last checkpoint is overlapped with the execution of the next few epoches, it is possible that a system crash happens before the most recent checkpoint is fully persisted. In this case, the only option is to discard the incompleted checkpoint, and use the most recent valid checkpoint for recovery. After restoring the system state to the beginning of a checkpoint, the recovery handler then instructs all processors to load the execution context, after which the system can resume execution.
Survive performs logging in the memory controller with the granularity of DRAM rows, which are typically a few KBs in size. The memory controller maintains a list of row addresses that have been logged during the epoch for every past epoch that has not been garbage collected. The DRAM hardware address space is divided into two parts: The first part serves normal DRAM reads and writes as usual, and the second part is a logging area where log entries and checkpoints are stored. When a DRAM row is written by a cache line replacement from the LLC, the memory controller checks whether the DRAM row is already in the log of the current epoch. If not, a copy-on-write is performed, during which the before-image of the row is moved to the logging area, and the address of the row is added to the end of the row ID list. The COW operation on DRAM can be implemented rather efficiently using RowClone-FPM: Two DRAM operations on the same sub-array can be accessed with back-to-back ACTIVATE commands without intervening PRECHARGE. Compared with performing logging from the processor, this approach has both lower latency and higher throughput, because the row copy operation is performed within DRAM, which does not use the bandwidth of memory bus.
Survive uses a different rollback model than most other designs. Instead of allowing data to be written back to NVM during normal execution, which may cause NVM to become inconsistent, Survive does not allow dirty write back to the NVM. This is achievable in Survive but difficult in other designs, because Survive builds upon conventional DRAM based systems, and therefore the DRAM is considered as the last level of volatile memory hierarchy. Since NVM always keeps a consistent snapshot, on crash recovery, replaying undo log entries becomes unnecessary, and the recovery handler just skips this step. In other words, Survive is a hybrid of redo and undo logging: undo logging is used for DRAM level rollback, while redo logging is used to keep NVM a consistent image at the end of some epoch in the past.
To keep the NVM image consistent while making progress, Survive gradually migrates data from DRAM to NVM. One interesting observation made this possible: For epoches i and j, where i < j, if an undo log entry is generated at epoch j, and no log entry on the same address is generated between i and j, then the undo log entry is actually accessible data for epoches between i and j. In other words, the state of any address at the end of epoch i is exactly the value of the undo log entry generated in epoch j, if such j exists. In the case where such epoch j does not exist, it must be the case that the cache line has not been modified between epoch i and the current epoch. The current value of the line can then be directly copied, because we know the current epoch has not modified the line. The second observation is that, the set of addresses of undo log entries generated by epoch i is actually also the write set of epoch i. In other words, by iterating over the set of addresses in epoch i’s undo log, we are able to obtain both the write set and the redo image of epoch i. In the following we will demonstrate how these two observations motivates data migration in Survive.
In order to track versions, the memory controller reserves a chunk of memory as tags for DRAM rows. Every row in the main area and logging area has a tag. The tag is a row address pointer whose meaning differs for main area and logging area. For main area rows, the tag points to the most recent undo log entries in the logging area. This simplifies rollback recovery, because the memory controller can just read the tag, and copy the undo image back to the corresponding row in the main area. For logging area rows, the tag serves as a pointer to the next (i.e. more recent) undo log entry. If the undo log entry is already the most recent entry, then the tag points to the corresponding log entry in the main area. When a new log entry is generated, it is inserted into the version chain as follows. First, the tag of the main area row is altered to point to the newly generated entry. Next, the entry’s tag is modified to point to the main area row. Finally, the tag of the previous topmost entry is modified to point to the newly generated entry.
With version chains, data migration proceeds as follows. A background thread in the memory controller periodically scans the list of epoches that have not been migrated. If the list is non-empty, the thread iterates over all undo log entries, and writes the next undo entry obtained from the tag into NVM. If the tag points to the main area, the memory controller should ensure that while the main area row is being read, no write operations should be able to modify the row, because otherwise dirty data written by the current epoch might be copied during the migration of a completed epoch. The background thread keeps scanning the list until all epoches have been processed, at which point the state of the NVM is exactly the state at the beginning of the current epoch.
Data migration for single epoch itself, however, is not atomic. If the system crashes during an active migration, the NVM will be left in an inconsistent state. To deal with this problem, the paper proposes that an extra non-volatile buffer be added to the NVM. Data migration first writes the after-image into the non-volatile buffer. Row copy is only performed after all rows have been safely committed into the buffer. Even if the system crashes during a migration, the migration can still be resumed by the recovery handler, because all of the epoch’s redo entries have been written into the non-volatile buffer.