Efficient Hardware-assisted Logging with Asynchronous and Direct-Update for Persistent Memory



- Hide Paper Summary
Paper Title: Efficient Hardware-assisted Logging with Asynchronous and Direct-Update for Persistent Memory
Link:
Year: HPCA 2018
Keyword: Redo Logging; Durability; NVM; Redu



Back

This paper proposes Redu, a transaction system that supports atomic durability using redo logs. Prior designs of transactional systems with durability must use one of the three logging schemes: (1) undo, where the before-image of the data item being modified is recorded in the log; (2) redo, where the after-image of the data item being modified is recorded in the log. In this case, the actual data item on NVM may not actually be modified, because the log is sufficient for reproducing the modifications; (3) redo + undo where both before- and after-image are saved in the log. This scheme allows more flexible handling of cache lines, because both uncommitted and committed transactions can be handled by the recovery routine using the log.

This paper claims that all the above three schemes are insiffucient to build a system that has both high throughput and low latency. For undo logging, since only the before-image is saved to the log, two types of write ordering must be observed. The first type is that the log write containing the before-image must be persisted before the updated data item. Otherwise, if a crash happens right after the data item reaches NVM (at which time the log entry has not made it), there is no way to recover from such failure since no undo image is present to roll back the changes. The second type of write ordering is that on commit, all dirty updates in the cache must be persisted on the NVM before the application can be notified of the commit. Otherwise, on a power loss, those committed dirty lines will be lost. Enforcing the two write orderings can have a negative effect on latency of both write and commit operations. For write operation, the log entry must circumvent the cache and be flushed to the NVM before the store could proceed. This is usually done using a cache line write back instruction and a store fence. Similarly, on commit, all dirty cache lines are written back to the NVM using write back instructions and fences, which is even worse, because this operation is on the critical path of the commit sequence. For redo logging, without substantially changing the memory architecture, it must be enforced that dirty cache lines must remain in the cache without being written back to the NVM. If this does not hold, then a power failure happening anytime before the transaction commit will corrupt data, because only the after-image is saved in the log (and if the log does not reach NVM at that time, it is even impossible for this data corruption to be detected). Log writing is usually not considered as a bottleneck for redo logging, because persistence of log entries are only required before commit. For undo + redo scheme, neither write ordering nor “no-steal” property is problematic, but the bandwidth requirement of writing log entries to the NVM doubles, which becomes a new bottleneck especially on write intensive workloads.

Prior works try to address the aforementioned problems of each scheme by adding extra components and ordering guarantees to the hardware. For undo logging, instead of using cache line write back instruction plus memory fences to enforce the ordering, it is proposed in previous work that a dedicated log entry queue be added. The log entry queue is filled with log entries that bypass the cache hierarchy. The observation is that the combination of write backs and fences is too restrictive: In fact, correctness is guaranteed as long as the log write happens before the actual update on the NVM side. In contrast, write backs and fences totally serialize the log write and the data item update, which can be relaxed a little. Since cachable dirty lines must go through the entire hierarchy, while log entries can be directly sent to the NVM, the fence and cache flush are no longer needed if the cache controller can take advantage of the timing. For redo logging, often people will propose to add a DRAM cache as the victim buffer. Evicted cache lines will be stored temporarily in the DRAM buffer. On system failure, the volatile DRAM will automatically roll back all uncommitted changes. Another potential problem for redo logging is that, if the system designer decides to use the log to update NVM, then the extra log reading traffic may saturate the bandwidth and lower performance for normal operations. The solution is to only use the log for recovery. During normal operations, it is the responsibility of the DRAM buffer controller to update data on the NVM.

Redu follows the approach of adding a DRAM buffer as victim cache, which allows dirty cache lines to be evicted from the cache before a transaction commits. Two tables are maintained in the DRAM cache itself to maintain block status and transaction status, respectively. The first table is offset table, which maps dirty cache lines evicted from the cache hierarchy to an offset in the DRAM cache. When a cache line is evicted to the DRAM cache, its physical address is used to map to an entry of the offset table. Each entry of the offset table is 64 byte in size (i.e. can be fetched by the processor by one cache line fill) and has 8 fields, 8 byte for each field. Each field consists of 48 bits for the physical address of the line, 15 bits for the transaction ID for virtualizing transactions across processes, and one valid bit. After being mapped to the entry, the DRAM cache controller checks if any of the physical address fields match the address of the evicted line. If a match is found, the corresponding data is updated. Otherwise, a new entry is created after evicting an existing one, and the data is written. Note that we do not store the offset explicitly. Instead, the offset is implied by the entry index and field index. Evicted lines from the DRAM buffer are written back to the NVM if they belong to committed transactions. Otherwise, if the line belongs to an uncommitted transaction, the transaction must be aborted, because Redu does not perform undo logging. To avoid frequent transaction abort, the cache controller prioritizes evicting committed lines. The second table is called transaction table, which maintains the status of running and committed transactions. For running transactions, their log pointers (the range of log entries) will be updated when a log entry is written. For committed transactions, their status will no longer change, but they must stay there until all dirty lines in the DRAM cache are flushed back to the NVM. For each entry in the transaction table, the number of blocks in the DRAM cache for that transaction is maintained and updated when cache lines are evicted. An entry can only be deleted after this number drops to zero, and that the transaction is committed. We say that a transaction “retires” if the transaction is committed, and all dirty lines have been flushed back to the NVM.

Two policies can be used to manage dirty lines in the DRAM cache. The first policy retains writing back cache lines until necessary, i.e. an entry cannot be found when a new cache line is to be inserted. This policy delays the write back of cache lines to an appropriate time, and can hence utilize the chance that a cache line is updated multiple times and that the updates can be merged in the DRAM cache. The second policy is to write back the dirty line as soon as it arrives. This policy eagerly flush lines back to the NVM, and only buffers dirty lines for a short period of time. The advantage of the eager approach is that a smaller buffer can be used. Also, since only a small number of lines will be in the cache at any given moment, the chance that a cache miss must be fulfilled by the DRAM cache is low. The latter enables the processor to use a predictor for determining whether the DRAM cache should be checked with high accuracy.

When a cache miss occurs, the cache controller should check the DRAM cache first in order to load the most up-to-date data. This increases the latency of cache misses. To avoid the performance impact of checking DRAM cache, the paper proposes using bloom filters to approximate membership checking. For the non-eager policy, an ordinary bloom filter is used to test membership with both false positibes and false negatives. If the bloom filter returns positive, the DRAM cache is accessed first. Otherwise, the DRAM cache and NVM are accessed in parallel. For the eager case, since only a small number of entries are in the DRAM cache, a counting bloom filter can be used to improve the prediction accuracy. Again, since both false positives and false negatives are possible, the cache controller must always check DRAM cache to avoid missing up-to-date data.

The log buffer is allocated in the NVM, and only has limited capacity. To free stale log entries in a timely manner to avoid log buffer overflow, transactions must be retired constantly to make space for later transactions. In order to ensure transaction retirement, the cache controller flushes all dirty and transactional blocks back to the DRAM buffer on transaction commit. Note that since the DRAM cache has faster write access than the NVM, this does not constitute a performance bottleneck. The cache controller then monitors the number of lines using the transaction table. After this number drops to zero, the transaction could be retired. A retired transaction is deteled from the transaction table, and its log entries will be freed. To simplify log management, a transaction’s log entries are only removed if and only if they are at the head of the log, and that the transaction is retired. This avoid creating holds in the middle of the log, which can complicated the recovery process.