ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems



- Hide Paper Summary
Paper Title: ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems
Link: https://dl.acm.org/citation.cfm?doid=2830772.2830802
Year: MICRO 2015
Keyword: NVM; Redo; Epoch; Checkpointing



Back

This paper proposes ThyNVM, a hardware checkpointing mechanism that uses multi-granularity multiversioning. Before ThyNVM, there are two ways of achieving continuous checkpointing: Logging and shadow page. Both approaches have inherent problems that limit their efficiency. For logging, the granularity that data are copied into the log plays an important role in the performance of the system. If log entries are generated on a fine granularity, then metadata overhead will be a few times larger than logged data itself, since for each log entry, we need to store the transaction ID, log sequence number, and some other auxiliary information. If, on the other hand, the granularity is large, the overhead of copying data into log entries will dominate execution time. The problem aggravates if data items are only written sparsely, in which case space is also wasted, because most part of the data copied into the log entry is unnecessary. Similar problems can be observed with shadow paging: On every memory update, if the page of the updated location has not been buffered by the DRAM, a page copy is made to transfer the page from NVM to DRAM, and all future writes to the NVM will be buffered by the shadow page until the next checkpoint, during which the dirty page is written back. This approach, however, is sub-optimal if the page is only written a few times before the end of the current epoch. Memory bandwidth as well as idle cycles are wasted due to the two data transfer between the shadow page and NVM.

No existing scheme works well under all workloads and all write patterns. Some schemes are particularly good for some workloads, and bad on others. This paper made the following two observations. First, if a location is only written sparsely (i.e. memory writes have little spatial locality around that location), the modification should be written directly into the NVM. This scheme avoids wasting memory bandwidth on unnecessary data transfer as well as wasting NVM storage by storing excessive metadata in log entries. Since only a few writes will be directly performed on NVM, the increase of write latency is expected to be minimal. Second, if a location is modified frequently and has good spatial locality, it should be buffered by the DRAM in page granularity to prevent repeated writes to the NVM and to improve the latency of write operations when conducted on the buffered page.

ThyNVM adopts a multiversioning scheme for supporting flexible granularity in order to leverage the above two observations. The system is assumed to be directly operating on NVM as the main memory. DRAM is also installed as buffers for data in the active epoch as we will describe below. System execution is divided into epoches, which are the basic units of checkpointing and recovery. At the end of each epoch, the system execution is interrupted, and the system state is dumped and stored on the NVM. On failure, it is guaranteed that the system state can be restored to a previous epoch, after which normal execution resumes and it appears as if no failure had ever happened. The memory controller is aware of the existence of NVM and DRAM, and can access both using the device ID and physical address on the drvice. The memory controller provides the abstraction of a unified address space to upper levels in the memory hierarchy, such that the semantics of existing programs compiled for DRAM do not change during normal execution (except that some operations may take longer to complete if they operate on NVM). At system startup time, the operating system probes the mapping information of the physical address space, and divice the address space accordingly into regions for different purposes, as we will describe later.

ThyNVM operates as follows. The address space is divided into two types: One for cache line sized multiversining, and another for page sized shadow paging. Processors do not generate log entries as they execute store instructions. Instead, new versions are created by remapping the physical address translated by the TLB to a new address. In order to achieve this, two mapping tables are added for each processor. These two tables are called Block Translation Table (BTT) and Page Translation Table (PTT) accordingly. Translation tables are associative searching structures that can be queried using the physical address of the memory instruction. On each query, multiple entries might be returned because multiple versions can exist if the same address is stored to in consecutive epoches. The mapped addresses and epoches of memory locations are stored in BTT and PTT entries (one physical address can only exist in at most one of these two for a given epoch). If an entry is not found for the current epoch, and the memory operation is store, then a new entry will be created in the corresponding table based on the type of memory address.

If the physical address is of cache line sized multiversining type, the memory operation will be directly performed on the remapped address of NVM (the processor should always map the location to NVM). The processor bypasses the cache, and stalls until the store is guaranteed to be persistent. This way, no dirty data needs to be flushed at the end of the epoch. The BTT which stores block mapping information is also changed accordingly. Note that within one epoch, the modifications can either overwrite earlier data on the same address, or appends to the end of the current epoch in a log-structured manner. The latter may work better with the NVM’s wear-leveling algorithm, and performing sequential writes are almost always faster than random overwrites.

If the physical address is of shadow paging type, any store operation will first trigger a page read from the NVM if the page has not been allocated in the DRAM buffer. An page frame will then be allocated in the DRAM buffer, which is filled with the page read from the NVM. A new entry will also be allocated in the PTB if not already. The store operation is performed on the DRAM buffer, and will not be written back to the NVM until the epoch ends.

At the end of an epoch, the memory controller issues notifications to all processors in the system to initiate state write back. On receiving the notification, processors save their execution contexts into a special NVM area, after flushing the store queue to guarantee that all memory operations are committed. The pipeline needs to be stalled shortly, but since checkpoints are relatively infrequent compared with instruction execution, the performance impact is negligible. After this point, processors enter the next epoch and resume normal execution. In the meantime, processors also initiate write backs of dirty pages to the NVM mapped by the PTT in the previous epoch. For each dirty page, the memory controller will reserve a page sized frame in the checkpoint area. The checkpointing process completes after all dirty pages in the PTT are persisted, at which point the memory controller also writes PTT and BTT at the end of the log.

ThyNVM overlaps the persistence of the previous checkpoint with the execution of the next checkpoint, thus reducing unnecessary stalls caused by the relatively costly NVM write back operation. If the execution of the next epoch is longer than the persistence of the previous epoch, all write back operations can actually be overlapped, and checkpoints appear to have almost zero overhead (otherwise new epoch cannot start until the persistence of the previous epoch completes). ThyNVM maintains states for three epoches: The active epoch, which is the currently executing epoch, for which all shadow pages are kept in the DRAM buffer; The last epoch, whose shadow pages and two tables are being written back to the NVM; The penultimate epoch, for which all states have been persisted. Entries in BTT and PTT must also be in one of these three states, and be discarded if the entry is three epoches away from the active epoch. Note that the paper did not discuss how blocks and pages in older epoches are handled. In the common sense, since metadata is not maintained for older items, in order for processors to locate them correctly, they must be written back to their “home address”, i.e. the physical address directly mapped without any version translation.

Although multiversioning allows multiple epoches to co-exist in the execution context, store operations from a newer epoch must not directly overwrite data from older epoches. In ThyNVM, two write ordering constraints must be observed. The first constraint is that store operation from a newer epoch must not overwrite dirty data generated by an older epoch before it is fully written back to the NVM. This is to avoid newer updates belonging to the active epoch polluting a previous checkpoint. If the memory controller detects that a store operation from the active epoch is to overwrite data on an older page that has not been persisted, the memory controller makes a copy of the older page, and adds an entry into the PTT, after which the store operation is allowed to commit. On the other hand, if the store operation writes to a page that has already been persisted in the last epoch, no page duplication is needed and the entry in the PTT is updated to reflect the fact that the page has been written by the current epoch. The second write ordering constraint is that if the active epoch is to write a block in the penultimate epoch using block update, the last epoch must have already been fully persisted. Overwriting blocks in the penultimate epoch is part of the implicit garbage collection (GC) process, since ThyNVM only maintains metadata for the three most recent epoches. A block in the penultimate epoch can only be overwritten, if this epoch is no longer needed for recovery. This condition is only achieved after the last epoch has been fully persisted, because otherwise the most recent consistent checkpoint is still the penultimate epoch, which must be protected from overwriting. If a store operation from the active epoch writes to a block in the penultimate epoch, and the last epoch has not finished writing back, the processor can either choose to stall to wait for the previous epoch, or allocate a block sized slot in the DRAM buffer to allow temporary buffering of block data. This temporary slot should be written back to the NVM before the current epoch commits as is the case for shadow pages.

ThyNVM monitors the store frequency of pages and blocks in the rum time to determine how each physical address is handled. At startup time, all addresses default to use one scheme. In the rum time, the memory controller uses a field in BTT and PTT to count the number of store operations on the block or the page. If a cache line sized multiversioning block has too many accesses in one epoch, then in the next epoch, the scheme will transit to use shadow paging. To perform the transition, the memory controller allocates a page frame in the DRAM buffer, and copies all dirty data from the previous epoch (and potentially from older epoches or the NVM home address) to the DRAM buffer slot. A new entry is also allocated in the PTT. Similarly, if a shadow page type address only observes a few stores, the memory controller removes the PTT entry for that page, and falls back to use NVM writes for future stores.

ThyNVM divides the address space into three parts: One for DRAM buffer, another for only storing epoches, and the last for storing epoches and other data that does not have a version in the two epoch areas. Note that ThyNVM only maintains NVM image for the last epoch and the penultimate epoch, the two NVM epoch areas are used in an alternating manner: If one epoch area is used for the penultimate epoch, then the memory controller will allocate space for the last epoch in the other area. Area usage information is stored also on the NVM at a known location, which is read on recovery to locate the last consistent epoch to restore. As we mentioned above, the paper did not mention how data migrates from the penultimate epoch to their home addresses if it has not been updated in the last and active epoch. If this is not done, then such data will be silently discarded and impossible to retrieve after the active epoch overwrites the penultimate epoch. On memory load, the physical address of the load will be used to query both BTT and PTT. The most recent version will be returned if the address exists in the tables (if the address is of shadow page type, multiple queries must be made to retrieve the location of all cache lines in the page). Otherwise, data will be loaded from the home address.

On recovery, the memory controller reads the last persisted epoch from one of the two epoch areas. Then it restores the BTT, PTT in the epoch as well as shadow pages. No extra undo and redo is required, because all information is present in the BTT and PTT. Recovery is hence very fast, which is crucial for achieving high availbility.