DudeTM: Building Durable Transactions with Decoupling for Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3037714
Year: ASPLOS 2017
Keyword: NVM; HTM; Transaction
This paper proposes DudeTM, a persistent transactional memory design which guarantees not only atomicity, consistency, isolation, but also durability. DudeTM is built upon Non-Volatile Memory (NVM), where read latency is close or equal to that of DRAM, and write operations are orders of magnitudes slower than reads. Another prominent feature of NVM is that unlike hard disks where data can only be accessed by block I/O, NVM supports byte addressable, DRAM style accessing interface, making it an ideal replacement for DRAM modules when persistency is desired. Transactions on NVM are said to be persistent, if the atomicity property of transactions can still be guaranteed in case of failure: Either all operations or no operation in the transaction is executed. In practice, it is often interpreted as the ability for the transaction processing system to guarantee that transactions committed before the crash can be recovered, while uncommitted transactions must be rolled back.
Two classical ways of realizing persistent transactions are undo and redo logging. In undo logging, transactions update data items on the persistent storage in-place, after they save the original value of the item somewhere else on the persistent storage. This is to ensure that even if the system crashes halfway during the execution when only part of the transaction’s output is written, it is still possible for the recovery handler to find out the fact that the transaction has not completed before the crash, and hence restore the original values from the undo log. The cost of undo logging is that the processor must issue a memory fence between the store instruction that saves the old value and the store instruction that updates the item in-place. This is to ensure that the log will reach persistent storage first, and then the new value. Otherwise, the recovery handler may not be able to restore the old value if the system crashes before the log entry reaches persistent storage and after the new value has been written. With redo logging, transactions do not write dirty items in-place. Instead, dirty items are kept in a transactional-local buffer as a redo log. On every transactional read operation, the buffer must be checked to make sure that the value generated by the transaction can be seen by the transaction itself. This requires a mapping structure, and the structure will be queryed for every read operation within the transaction. On transaction commit, the redo log is flushed to the NVM first for persistency, and then data items are updated in-place such that following reads can access the updated value. On recovery, the handler reads the redo log from the persistent storage. If the redo log indicates that the transaction has committed, then values in the redo log will be replayed in the order consistent with the serial order for each transaction.
Neither undo logging nor redo logging is ideal for a persistent transaction processing system. The former penalizes write operations while the latter penalizes reads. In contrast, DudeTM achieves both cheap reads and writes using shadow pages. Every virtual page in DudeTM is backed by two physical frames: One from the persistent NVM as the main storage, another from the volatile DRAM as shadow memory. The overall architecture is quite similar to a disk-based virtual memory system, where disk pages are persistent NVM, and physical DRAM is the shadow page. For transactional writes, if the corresponding page has not been allocated a shadow page, then the shadow page will be allocated from DRAM, and the write operation is directly conducted on the shadow page. No store fence is required because write operations will not propagate to the NVM. For transactional reads, no extra redirection needs to be done. The MMU translates the read virtual address into the shadow page if it exists, and hence we do not pay the extra overhead of software redirection.
DudeTM decouples the TM core from the persistency engine. Virtually any existing TM design can be used as the transactional core of DudeTM with the slightest modification. The only constraint that DudeTM imposes on the TM implementation is that transactions must have a timestamp which is consistent with their logical ordering, which already exists in some designs, and can be added without much effort in others. The execution of DudeTM consists of three stages. In the first stage, transactions are executed on the shadow memory using the transactional core. Transactional reads and writes follow the rule we discussed in the previous paragraph. Note that logging transactional writes is not a requirement of the TM core; The TM core can use any concurrency control method and version management scheme as long as all modifications are on the shadow memory. The first stage ends when the TM core commits.
In the second stage, DudeTM collects the write set of the transaction as a redo log, and flushes them to a special logging area in NVM. The logging area is a chunk of memory allocated at startup time. Its size is fixed upon allocation. For each transaction, the log is also tagged with the serialization order of the transaction (in some designs this is the transaction commit timestamp). After the log flushing operation is acknowledged by the NVM, the transaction commits, and the application program can be notified of a success transaction. Note that the second stage can only be considered as completed after all transactions with a smaller ID have completed the second stage. This is to avoid committing a transaction before its logical predecessor is able to commit. Otherwise, if a crash happens, there is no way to recover the predecessor transaction, while the application treats the current transaction as committed. If the transaction decides to abort, either because of conflicts or it aborts proactively, all log entries as well as dirty data items are discarded.
In the third stage, a background thread (or an idle transaction thread) merges the log from the second stage to NVM. This stage can be asynchronous with the previous two stages (stage one and two must be synchronous). The background thread scans the logging area for a committed transactions with the minimum transaction ID. Once this is found, the thread then applies the updates recorded in the log to the NVM. The log can be discarded after this stage completes. This stage is not strictly required for completeness: transactions are already committed and considered as persistent after the second stage. By replaying the log on the NVM in the background, we allow more efficient access of data by later transactions.
On recovery, the handler restores the persistent NVM to a consistent state as follows. The recovery handler first scans the log area of each transaction on the NVM. Similar to the third stage above, it searches for the minimum log that has not been replayed. If the entry is found, it is re-applied to the NVM. Note that although this may apply a log that has actually already been applied before the crash, the correctness of the recovery process is not affected. If a log is incomplete, then that log as well as all logs for later transactions must be discarded, and they are considered as aborted. The application should be notified of the fact that some transactions did not complete.
The shadow memory is backed by DRAM whose size is typically smaller than the NVM. If a shadow page needs to be brought in, but there is no physical DRAM available, an existing shadow page must be evicted. Evicting a shadow page is different from evicting a page in virtual memory: Dirty pages do not need to be written back, since the synchronization between shadow memory and NVM is performed only using logs. When a shadow page is first brought into memory, we must ensure that all updates on the corresponding NVM page has been applied. To achieve this, we maintain a last modified field in the page table. The field is set to the maximum transaction ID that has updated the page. When the shadow page is to be loaded, the OS checks whether the last modified ID is less than the most recent committed ID (the latter is maintained somewhere in the NVM and updated when a transaction commits). If it is the case, then we are certain that all updates on the NVM page have been merged, and the shadow page is just a direct copy of the NVM page. Otherwise, the shadow page loader must wait until the most recent committed ID is not less than the last modified ID.