DHTM: Durable Hardware Transactional Memory



- Hide Paper Summary
Paper Title: DHTM: Durable Hardware Transactional Memory
Link: https://ieeexplore.ieee.org/document/8416847
Year: ISCA 2018
Keyword: NVM; HTM; DHTM; Redo Logging



Back

This paper proposes DHTM, a hardware transactional memory scheme that supports durable transactions. This paper begins by identifying the challenge of implementing HTM with durability support as efficient logging. Software logging does not work with HTM since current commercial HTM implementation will immediately abort when a cache line flush instruction evicts dirty data to the NVM. Naive hardware logging, such as LogTM, does not work well for NVM as well, since its abort and commit latency will be much longer on NVM compared with DRAM-based implementation. For example, in undo-logging based LogTM, transactions update data in-place after persisting the undo log to a log buffer stored on NVM. The write ordering is enforced by always persisting the undo log entry immediately before data is updated in-place in the cache. On transaction commit, the log can be safely discarded after all dirty data written by the transaction is persisted to the NVM. This process cannot be ovrelapped with normal execution, since the transaction is not logically committed until dirty data is written back, resulting in longer latency on the critical path. Similarly, on transaction abort, the undo log is walked by the cache controller to restore memory locations updated by the transaction to the previous state. The abort is not logically completed until all data has been read and applied to the memory locations.

DHTM is based on commercially available Intel Restricted Transactional Memory (RTM), which support transactions of limited size in both space and time. RTM tracks the read and write set of the transaction using special bits in the L1 tag. To guarantee isolation, a speculatively read block must be modified by another processor before the transaction commits, and a speculatively written block must not be read or written. RTM detects conflicts using the coherence protocol, which maps perfectly to reader-writer locks in the above abstraction. RTM provides only very limited support when a speculatively accessed cache line is evicted from the L1 cache, since lower level caches are not equipped with the special bit for tracking speculative lines, and cache walks are more difficult to perform in lower level caches. When a speculatively written cache line overflows, the transaction will be aborted immediately since lower level cache does not track the write set. When a transactionally read transaction overflows, however, the hardware may insert the block address into a per-L1 signature, which is tested against incoming coherence messages for conflict detection. A conflict is detected if either the conflicting address is speculatively accessed in the local L1 cache, or if the signature indicates positive.

DHTM employs redo write-ahead logging for delivering durability guarantees. Each transaction is allocated a transactional-local logging area on the NVM before it starts. The OS could enumerate all logging areas after a crash for recovery. The write ordering of redo-logging dictates that log records must be written back to the NVM before any dirty data item could. The paper therefore maintains an invariant that speculative data can never be evicted from the LLC. In the case that this happens, the transaction is aborted and the control flow falls back to the software handler as in non-durable RTM transactions. To solve the read redirect issue (if we only write redo log, then in order to read its own data, transactions must walk the log instead of reading in-place data), DHTM requires that each write to a cache line be duplicated into two stores. The first store is to the cache line data itself, while the second store is to the logging area. This way, transactions can always read its own dirty data by simply performing a read in the cache. We next introduce hardware extensions and details of operations on DHTM.

To reduce memory traffic from the processor to NVM and to leverage the fact that a cache line can be written multiple times before the transaction commits, log entries are not immediately persisted to the NVM, but are buffered on-chip for a while to take advantage of write coalescing. It works as follows. The L1 cache controller is extended with a log buffer which stores log entries that have not yet been persisted. An entry in the log buffer consists of a cache block address and dirty data that have been written in the address. When a store instruction updates data in the cache, the corresponding entry in the log buffer is also updated in parallel with the cache access. If the entry does not exist, an existing entry is evicted from the buffer and written into the NVM (althoug not mentioned by the paper, we can use a bitmap to indicate which bytes are valid in an entry if the entry contains “holes”), and then it is allocated for the currently updated cache line. This way, we avoid word-level logging by having too much metadata. We also avoid the problem of large write amplification by performing delta-logging without persisting unmodified bytes in the cache line. Log entries are persisted lazily such that an entry in the log buffer can absorb as many writes as possible before it is evicted, reducing unnecessary NVM writes.

We first assume that no speculative dirty data overflows from the L1 cache. Transactional loads and stores proceed as in RTM transactions, except that store instruction also writes into the log buffer. Conflict detection is also unchanged. When the transaction is to be committed, we first commit the transaction on-core like how RTM commits. Then we flush the log buffer to the NVM followed by a commit mark, the completion of which is the logical point that the transaction commits. The after-commit state can be recovered by replaying the log if the system crashes after this point. To ensure that data is also updated in-place for later reads, we walk the L1 cache, and writes back dirty lines written by the transaction. This process can be performed in the background, and the processor that just committed the transaction could resume execution on non-transactional code. No new transaction could be started, however, on the processor before the background write back completes, since we rely on L1 cache tags to identify dirty data for the background write back. In the meantime, if data conflict occurs, e.g. another core attempts to access a cache line written by the just committed transaction, we should first schedule a write back of the line, and then give up the ownership to the requesting processor. The requesting processor should write a data dependency log record in its log header to indicate that the log of its own transaction can only be replayed after the just committed transaction’s log. This is necessary, since if the system crashes before the completion of the write back and after the logical commit of the second transaction, both transactions may have a committed log but incompleted data write back. They also have a data conflict since the second transaction may have overwritten the write set of the first transaction. In this case, the log of the first transaction must be replayed before the second transaction, such that the final memory image contains the most up-to-date version of the cache line. After the write back completes, a log truncation log record is written to indicate that the log has been truncated logically, and no log replay is necessary after crash. No data dependency tracking is needed after write back finishes, since we know the first transaction’s log will not be replayed.

The abort process is similar to the one in RTM, except that we also write an abort record to the end of the log to indicate that the log has been invalidated, and it should not be replayed on recovery. The log buffer is also cleared.

If dirty cache lines are allowed to be evicted to the LLC (but not out of the LLC), extra tracking of write sets are needed to avoid walking the potentially large LLC on each transaction commit and abort. DHTM assumes a directory based cache coherence protocol, and employs the “sticky state” from LogTM to track dirty cache lines that are written to the LLC. The sticky state essentially virtualizes the L1 cache to make it larger than it actually is by retaining ownership and not clearing the bit when a dirty cache line is evicted from the L1 cache to the LLC. We describe it as follows. Transactional loads and stores are not affected. When a dirty cache line is evicted from the L1 cache, instead of aborting the current transaction, we simply write back dirty data to the LLC, and preserve the ownership of the line by not changing the directory bit and the state (it should be modified state). To avoid costly cache tag walk on commit and abort on the LLC, the cache controller also writes the address of the evicted block to a special overflowed area in the per-transaction log. Note that only the address is written, since data is still stored by the LLC. Conflict detection works as usual, since the write set is tracked by directory states, rather than per-cache line status bits. On transaction commit, we perform in-cache commit as described above, after which dirty blocks are written back. For dirty blocks in the L1 cache, they can be identified by a less costly cache walk. For dirty blocks residing in the LLC, we walk the overflow area, and for each block address, we use it to probe the LLC, and schedule a write back if the line is found (it may not be found if the line is later fetched back to L1 and the cache is not inclusive). Aborts follow a similar two-phase process. In the first phase, in-cache abort is performed by clearing all status bits related to transactions. In the second stage, we walk the overflow area, and invalidate all dirty lines in the LLC if found.

One tricky part of this protocol is when a cache line is evicted from the L1 and then reloaded back to the L1. If care is not taken, this line may be loaded in a clean state, i.e. the transactionally written bit is not set. Later on, when the transaction is committed, the cache tag walker will skip write back for this line, which causes correctness problem, since the line content is not crash-safe. To avoid reloading a dirty line in non-transactional state, the cache controller must check the line to see if the upper level cache requesting a line fetch is the current owner of the line or not. If positive, then the directory knows that the line is transactionally written and preserves ownership with the sticky bit. The directory indicates to the line requestor that the line should be placed in modified state with the transactional bit being set. A similar check is performed when a line is evicted from the LLC. The directory must check whether the line is owned by an upper level cache. If it is the case, the upper level cache is notified about the eviction. The transaction will be aborted if the evicted line is part of an active transaction (the line can either reside in L1 or not in this case). The eviction will also be cancelled in this case, since the line contains dirty data that belongs to an aborted transaction.