SPHT: Scalable Persistent Hardware Transactions

- Hide Paper Summary

Paper Title: SPHT: Scalable Persistent Hardware Transactions
Link: https://www.usenix.org/system/files/fast21-castro.pdf
Year: FAST 2021
Keyword: NVM; SPHT; HTM

Back

Highlight:

Transactions only work on a shadowed working set on volatile memory, writing redo logs, while NVM image is only updated by log replay threads. This is a form of “shadowing + redo” where no write ordering is required since data evicted from the cache will not pollute the NVM image anyway. This is particularly good for HTM execution, since no eviction is needed as a result of no write ordering. There are lots of variants of this. One is object-level shadow, meaning that an object is copied to the volatile DRAM when first time opened, and stayed in the DRAM even after transaction commit. Another variant is hardware extension where a DRAM buffer is put between NVM and LLC. Memory changes are absorbed by the DRAM buffer while logs are directly persisted to the NVM. This paper implements shadowing + redo using mmap’s CoW option such that when an NVM page is first time written by a transaction, it will be CoW’ed to the volatile memory.
Assign commit timestamps to transactions, establishing total ordering between transactions, and enforce the total ordering during commit by letting transactions wait for smaller numbered transactions to complete writing its log before it returns. This ensures that is the current transaction is restored, then smaller numbered transactions must also be restored, preserving dependency (although conservatively).
Use log chaining to avoid sorting the log from several log buffers. Although the overhead is constant, the extra constant factor is still non-negligible when the number of threads (i.e., number of log buffers) is large.

Comments:

I don’t get why do you need a global commit mark (pmarker in the paper)? I understand that it represents current global progress, i.e., at which point is recovery guaranteed, which can be used for recovery (everything within this point is recoverable, and anything beyond is discarded). But this is really just unnecessary. Just let each transaction write a commit mark at the end of its redo log, after the loop-wait completes (i.e., after smaller numbered transactions have written their commit marks). Also let transactions only advertise themselves as fully committed after writing this mark. This way, we do not even need a global pmark, and instead, each transaction just commits in a distributed manner. Also the paper did not mention how pmark is used during recovery.
At the first sight, the “skip-CAS” path has two problems: (1) Starvation may occur, i.e., every thread sees a larger timestamped thread and skips CAS, causing no one to actually update the mark; (2) Between the gap when a transaction skips CAS and the mark is updated by a larger transaction, crash might occur, and the transaction is not immediately durable. With a closer look, neither will happen. For (1), it is only a concern when there is an unlimited number of threads. In practice the number of threads are finite, and the wait chain cannot expand indefinitely. For (2), in the pseudocode it is shown that a transaction, if skipping CAS, need to wait for the marker to change before it can return to the caller. This guarantees immediate durability.
I could not quite get why log sorting is an issue? Shouldn’t it just be a merge sort that always picks the smallest log object from K log buffers (where K is the number of worker threads)? The paper may just refer to the overhead of selecting the max-K every time a log is replayed?
The paper mentioned that WBINVD instruction is used to drain the cache after log replay. This instruction is privileged and require a mode switch.

This paper presents SPHT, a durable hardware transactional memory design that achieves high scalability. The paper recognizes three major difficulties in designing a durable TM. The first difficulty is atomicity of transactions in volatile memory, which must be enforced by some volatile TM subsystems. Two options are available: STM and HTM. The issue, however, is that STM will incur non-negligible instrumentation and metadata overhead, which harms overall performance especially when transactions are short. HTM, on the other hand, is largely free of runtime overheads for ensuring atomicity. Their flexibility, however, is inferior compared with STM, as most commercial HTM implementations nowadays will not allow a cache line to be flushed out of the cache during transaction, which will result in an immediate abort. This feature, unfortunately, makes it challenging to implement write-ahead logging (WAL) for durability at the same time, since WAL enforces write orderings between log entries and data blocks using cache line flush instructions. The paper mentions that prior designs either adopt shadow paging to avoid dirty transaction data from being written back to the NVM image, or perform complicated non-destructive logging to avoid enforcing any write ordering.

The second difficulty is scalability of commit. The paper notices that durable transactions must guarantee that the durability order (i.e., the order of transactions that logically commit on the NVM) must be consistent with the memory consistency order (i.e., the order of transactions that logically commit on volatile memory via load-store ordering). Otherwise, during recovery, the effect of a transaction that had been acknowledged by the TM may not be recovered, causing the lost update anomaly, since the transaction whose updates are lost may have already incurred some external effect (e.g., a message sent to the user) that will not be undone by the crash.

The last difficulty is the scalability of log replay during recovery and log truncation. The latter is far more commonly used, and is hence more performance-critical. The paper points out that prior proposals all use a single background thread for log replay, which does not scale with the number of worker threads, as the number of log entries to replay will grow proportionally to the number of worker threads.

SPHT addresses the above problems using a combination of techniques. First, SPHT relies on commercial HTM (Intel TSX/RTM) for atomic transaction in volatile memory. Instead of writing directly to the NVM image, SPHT transactions only perform writes on a shadowed working set allocated on volatile memory, which is shared by all transactions, and is treated as the runtime working set. The NVM image, on the other hand, is not directly accessed by any of the transactions, but rather, transactions generate redo log entries for each write, and persist these entries into the NVM only after HTM transaction commit. The SPHT log replay threads scan these redo log entries in the background, and apply writes in the log to the NVM image instead.

The actual implementation uses mmap() to create shadow copies of the working set. SPHT assumes that NVM is used in the form of persistent heaps, which is mapped as a DAX file to the virtual address space of the process. The manager thread then further performs an mmap() with options being MAP_PRIVATE, such that first-time writes to the DAX region will create a copy-on-write instance of the page on volatile memory. All later modifications to the same virtual address will then be conducted on a physical page backed by volatile memory as well, which do not affect the consistency of the NVM image. The manager thread and log replay thread mentioned earlier, however, access the NVM image using DAX-mapped virtual addresses, such that these changes, when being flushed, will be sent to the NVM device. (OK, I am not an expert on mmap and DAX related system calls in general. Correct me if I am wrong.).

The benefit of using shadow copies of the NVM working set is that no write ordering is required between data blocks and log entries. This is because dirty blocks, even when evicted, will only be written to the shadowed location in the volatile memory without polluting the NVM image. It is, therefore, only needed to just replicate what has happened on the shadowed working set to the NVM image in the form of redo logs.

To address the second difficulty, SPHT proposes that threads should determine the most recent persisted log in a distributed manner to avoid centralized commit protocols where only one thread is allowed to perform at any time. The latter is particularly bad for NVM, since thread commit throughput is, in fact, bounded by the write throughput of a single thread to flush the log entries and to write the commit mark. SPHT assigns each transaction a commit timestamp before commit point, and assumes that the logical ordering between transactions is consistent with the total ordering implied by the logical timestamp. SPHT then forces transactions to complete their commit sequences in the order implied by the timestamp, which guarantees that a transaction will not acknowledge its commit back to the caller before it is logically persisted on the NVM. This property, called immediate durability in the paper, is of extreme importance to practical durable TM designs. Designs failing to observe this property will acknowledge a transaction as being committed to the user, but may still lose committed states, causing incorrect recovery.

SPHT achieves immediate durability as follows. When the transaction completes, the HTM is first committed using the hardware primitive. The logical timestamp is allocated to the transaction before the hardware commit using system-wide hardware counters. After the hardware commit, the thread then starts to write back its redo log entries to the per-thread log buffer on the NVM. The log object consists of a header containing the logical timestamp, a body, and a end mark which is used to determine whether the log is fully written. The thread also publishes its commit timestamp and whether it has already fully persisted the redo log (using a bit flag) in shared volatile memory. After the log is fully persisted, indicating that the transaction has been logically committed, it then waits for all transactions with a smaller commit timestamp to also fully commit, before it returns control to the caller. The wait process is essentially a loop scanning the logical timestamp and commit status of other transactions. If all transactions with a smaller timestamp has fully persisted, the current transaction also commits, since it is now guaranteed that the transaction can be fully restored with all its potential dependents (i.e., those with a smaller timestamp) also fully restored.

To indicate global progress, SPHT also maintains a commit mark that stores the logical timestamp of the last committed transaction. The commit mark is read by recovery and log truncation threads to determine the range of logs that should be replayed on the NVM image. The mark is updated by a committing transaction after it fully persists its own log and after the wait. The update operation is performed by Compare-and-Swap (CAS) to avoid data race between different threads. The paper also proposes an optimization: When a transaction is in the wait loop, and it observes that a larger numbered transaction exists (and must also be waiting), the current transaction will simply skip CAS, and expects the other transaction with a larger commit timestamp to perform the update. The current transaction, however, still needs to wait until the mark is actually changed, before it completes the commit sequence. This is to avoid the small window where the transaction has returned to the caller but the global mark has not been properly updated from occurring.

To solve the last difficulty, SPHT proposes to chain log objects together such that log replay threads can locate the next log object in constant time. Conventionally, with K log buffers, the replay thread must compute the minimum log object, which adds a constant factor of K to the complexity. In SPHT, each log object contains a pointer to the next log object in the total order enforced by the commit timestamp. Transactions, at commit time, should locate the previous log object by scanning the status of other transactions. If a transaction whose commit timestamp is exactly the current transaction’s timestamp minus one has just committed, the current transaction will update the previous transaction’s log with the address of its own log object.

To increase the throughput of log replay, multiple threads are used, and each thread must traverse the full set of log objects. Every replay thread is assigned an address range, and they only replay log entries whose addresses fall into the assigned range. The paper proposes that the address space be interleaved at 4KB granularity between the replay threads. Certain optimizations are also applied, such as NUMA-based optimization in which a replay thread is placed on the NUMA node that only writes local memory at the node.