MorLog: Morphable Hardware Logging for Atomic Persistence in Non-Volatile Main Memory



- Hide Paper Summary
Paper Title: MorLog: Morphable Hardware Logging for Atomic Persistence in Non-Volatile Main Memory
Link: https://www.iscaconf.org/isca2020/papers/466100a610.pdf
Year: ISCA 2020
Keyword: Logging; Undo Log; Redo Log; MorLog



Back

This paper proposes MorLog, a hardware logging scheme for providing atomic persistence to transactional NVM systems. Conventional undo and redo logging suffers from performance overhead from several aspects. First, undo logging requires enforing the write ordering between log entries and data on the same address, which would require eagerly flushing undo log entries before data is written in the cache, given that the cache controller may write back the cache line any monent after it becomes dirty. In addition, undo logging requires that dirty data be persisted on the NVM before a transaction commits, which can cause long commit latency. On the other hand, redo logging, though not need requiring flushing log entries before the commit point, should prevent dirty blocks from being written back before the transaction commits. This is to avoid write backs of uncommitted data corrupting the consistent before-transaction image on the NVM. Both schemes, if used naively in their canonical forms, require non-negligible changes to the cache controller and eviction logic. Second, if not treated with care, it is easy to generate redundant log entries that are not useful for recovery, but will take extra space on the NVM. For example, in undo logging, the actual useful log entry is the one taken on the first write since transaction begin. Later writes do not have to be logged, since the pre-image they record is merely dirty data generated within the same transaction. Similarly, in redo logging, only the last write to an address should be logged, if the address is updated multiple times. Previous redo log entries will be overwritten during recovery by the last entry, rendering them totally useless and a waste of space. The last issue is log storage overhead and logging granularity. For hardware logging, the most natural granularity of log genetation is a cache line, which will be included as a fixed sized pre- or post-image field in the log entry. Cache lines, however, may only contain a few bytes of dirty data, making the writing of the clean bytes unnecessary. In addition, cache lines often contain replicative or redundant data, making compression an attractive measure for reducing log storage, the opportunity of which was never explored before this paper.

Based on the above observation, this paper makes the following contributions. First, it implements mixed redo-undo logging by simultaneously generating both undo and redo log entries on an L1 write, which get rid of two write orderings. First, the cache no longer needs to be flushed on transaction commit, since dirty data has been persisted in the form of redo log entries. Second, the cache no longer needs to hold dirty blocks without evicting them before transaction commits, since undo log entries help restoring the pre-image of data even if the NVM image has been pollutedd by a premature write back from the cache hierarchy. Two write ordering restrictions remain, however, to maintain the correctness of the algorithm. First, redo log entries still need to be written back on commit point to persist the transaction’s working set. Second, undo log entries still have to be written before dirty block reaches the NVM. The paper loosens the second ordering restriction a little by taking advantage of the minimum number of cycles it will take a block to traverse through the cache hierarchy to reach NVM.

The second contribution of the paper is the eager-undo and lazy-redo log flushing scheme. As discussed in previous sections, only the first undo and the last redo log entries are actually useful for crash recovery. To leverage such differences in the characteristics of logging, the paper adds two log buffers on-chip, one for mixed logging, and the other for redo only. Both undo and redo entries are generated into the mixed logging buffer on the first write to a clean L1 cache line. These entries are kept within the buffer for an extra N cycles to exploit possible chances of log coalescing, in which subsequent writes may modify the same cache line, and these writes could just be incorporated into an existing entry for the same line. The undo log entry is flushed back to the NVM after N cycles, where N is the minimum number of cycles it takes for a cache block from L1 to traverse thehierarchy and finally reach the NVM, ensuring that the undo entry can always be written before dirty data. On transaction commit, both buffers are flushed, if they are not already empty (the mixed buffer is likely to be empty or only contains a small number of entries, since it eagerly writes back entries once generated).

When a mixed buffer entry is written back, only the undo entry is sent to the memory controller, bypassing the cache hierarchy to avoid complicated timing issues. The redo log entry, however, are transferred to the redo buffer for delay write backs. This design decision is made based on the observation that redo log enties do not form write ordering with dirty data, and that redo entries are better postponed to the end of the transaction for maximum chances of log coalescing. The redo buffer is flushed when transaction commits as in normal redo-based systems. When a dirty block is written back from the LLC, the LLC controller will also notify the on-chip redo buffer to drop the corresponding redo entry, since data in the redo entry has already been persisted by the write back.

To help tracking the status of cache lines, the paper proposes adding three extra fields to L1 cache tags: TID, TxID and state. TID and TxID uniquely identify a transaction instance in the system. The state field consists of two bits, which defines four possible states: Clean, dirty, URLog and ULog. These four states are orthogonal to the coherence states used for cache coherence, and should be maintained separately. Clean state means that the block has not been written yet since the last write back/eviction. Dirty state means that the block has been written, and none of the undo and redo entries are persisted. A dirty state block transits to the ULog state when only the undo log entry is persisted. This happens after N cycles of the first update to the block, at which time the mixed buffer just writes back the undo entry. Writing a ULog state block should never generate an undo entry again, since undo entry is only useful when it is generated by the first write during the transaction. Redo entries are incorporated with an existing entry in the redo buffer, since the state indicates that a redo entry already exists. An ULog line transits to URLog when the redo entry is also persisted, perhaps because of resource hazard in the redo log buffer. Similar to ULong lines, URlog lines should also never generate undo log entries when written, but redo entries are inserted into the redo buffer, because the entry has already been written. The state should also transit back to ULog to reflect the fact that the redo entry exists in the buffer. The difference between ULog and URLog state is that when the transaction commits, cache blocks whose TxID and TID match the transaction should have their corresponding redo entry writing back to the NVM, if the block is in ULog state, while URLog lines do not require any extra write backs. Both ULog and URLog state lines transit lazily to Dirty state when they are written by a transaction with different IDs then the one that creates them. Note that the paper gives a rather obsecure description of these states, especially the difference between ULog state and URLog state lines, and whether these lines are scanned by tag walk on commit. I doubt whether it is necessary to have split URLog and ULog line, since they make no difference even if combined. Anyway we just need a notion of “undo entries already persisted”. Whether redo entries are persisted is of no importance, since: (1) we could always perform an associative search when writing an URLog line to determine if it is URLog state or ULog state; (2) We could always walk the redo buffer on commit, which is even easier than walking the cache tag to find ULog state lines in order to flish the redo entries.

The paper further proposes a delay flushing mechanism for avoiding flushing all redo log entries on the commit point. Although this mechanism may roll back committed transactions on a restoration, it increases transaction throughput by moving log flushing to the backend, while still guaranteeing correctness. The mechanism works as follows. First, on transaction commit, the commit record is written without waiting for redo log entry flush. Then the processor can start the next transaction. In the meantime, a background cache tag walker searches the cache and counts the number of blocks in ULog state with the TxID and TID matching the transaction that just committed. The count is then persisted to the log buffer as a special entry. Redo entries are allowed to be written back via normal redo buffer eviction. On recovery, the recovery handler scans the log after it has seen a commit record for possible redo entries and the special counter entry. If the counter entry is found, and the number of redo entries that appear after the commit record match the value in the counter entry, the transaction is still considered as committed, since all its dirty data can be restored from the log. Otherwise, the transaction fails, and should be rolled back using undo logs. Note that in this case, the eviction of a dirty block should not directly cause the redo entry to be dropped. Either the redo entry is written back as usual, or a dummy entry is written to notify the recovery handler that dirty data has been persisted.

The paper then proposes a compression scheme for reducing the size of the log, called Differencial Log Data Compression (DLDC) and Selective Log Data Encoding (SLDE). DLDC simply takes advantage of the fact that sparsely written lines need not be logged at cache line granularity. In fact, only logging the affected bytes can guarantee correctness, with less log storage to be consumed. The paper proposes adding a bit vector for each L1 entry, the bit of which will be set when a write updates its value. The bit vector can be per-byte or per-word, which determines the logging granularity. Both log buffer entries are also extended with bit vectors, which will be set when log entries are generated or merged. Logging area only keeps delta data without copying the entire cache line. Note that the paper suggests using value-based delta detection instead of using write ranges. Value-based detection exposes more cases than simplying using write range, at the cost of requring data comparison before and after the write. Fortunately, within the current cache organization, the before-image of a cache line before it is updated is already read out from the array, which provides a great chance to perform such comparison with incoming data (which is often a 64 bit word). When flushing log buffers, the L1 controller checks whether all bits are clear. If true, the entry is sliently discarded, since it contains no valid data to log.

Delta-encoded lines are compressed with a pattern-matching algorithm to further reduce its size, leveraging special constructs such as leading or trailin zeros/ones. The paper notes, however, that if the majority of bytes are dirty, compression may unfortunately result in more bytes being used than uncompressed log entries. In this case, the paper suggests that SLDE be used, in which another FPC encoding engine compresses the log entry in parallel with DLDC compression. The final form of the compressed log entry is determined by the compression ratio. An extra field is also attached to the log entry to indicate the compression algorithm.