Crafty: Efficient, HTM-Compatible Persistent Transactions

- Hide Paper Summary

Paper Title: Crafty: Efficient, HTM-Compatible Persistent Transactions
Link: https://dl.acm.org/doi/10.1145/3385412.3385991
Year: FAST 2021
Keyword: NVM; HTM; Crafty; TSX/RTM

Back

Highlights:

Using two executions to avoid in-transaction write ordering. The first speculative execution derives undo
log entries, and does not change the memory image (discarding all speculative changes to the shared state). The second execution applies memory updates from the first speculative execution. This way, write orderings only happen between the first and the second execution. Think of it as having an Oracle that magically knows the memory location to be updated. This is similar to deterministic database transaction where the conflicts are estimated by a speculative “dry-run” of the transaction, which is then validated by an actual execution. If the two mismatch, the speculative execution is considered as invalid, and the transaction will be reattempted.
Both executions are wrapped in TSX/RTM for atomicity. The atomicity between the two executions is enforced using a commit timestamp, and if it fails, value validation. This is similar to NORec (a minimalist STM design).
From a high level, this is just treating the entire transaction as a multi-location memory operation that happens atomically. We generate undo log entry for each memory location first, persist these entries, and then perform the memory operation, persist dirty blocks, and then truncate the log. The caveat in this generalized model is that we need to guarantee that the multi-location memory operation (i.e., data generated by transaction body) always performs the same writes at the same locations between the logging phase and the execution phase.

Comments:

The paper seems to miss a store fence after second-stage execution: The COMMITTED mark must only be
generated and persisted after all dirty blocks have been persisted. This write ordering requires a persist barrier to enforce. Because otherwise, it is possible that the COMMITTED mark reaches the NVM before data blocks do, and if the system crashes at this point, the persisted image is corrupted, but COMMITTED record will mislead the recovery handler to think that the transaction is fully committed. Maybe the author assumes that clwbs will be persisted in the order that they are issued (they are not in practice)?
Re-execution requires that the transaction body be functional (output only depends on input but not some hidden state variables) and that the arguments must not be altered (or can be restored to their initial states as in the first-stage execution). Otherwise, second-stage re-execution will not be possible.
I do not quite get how “Discarding entries and bounding rollback severity” works. From a high level perspective, shouldn’t it just be maintaining a low-water mark, which is the timestamp of the earliest COMMITTED mark (the paper uses “tsLowerBound”) in the system. Log truncation works by deleting logs whose timestamp is “tsLowerBound” and adjusting “tsLowerBound” to the one of the next earliest COMMITTED mark. It is like merge sort, which is also very similar to the log truncation algorithm of redo-based designs (truncating the log based on timestamps, earlier ones first). Of course, we need to make sure that when we delete a log, all its dirty blocks must have already been persisted (if the barrier between data flush and COMMITTED mark is inserted correctly, as in (1), then this should be always true). Is it my own problem, or the authors over-complicate things here?

This paper proposes Crafty, a novel software transaction memory design based on re-execution. The paper begins by observing that all previous designs have non-negligible run-time costs that root deeply into the methodology. The paper investigates into three typical mechanisms: undo logging, redo logging, and shadow paging. Undo logging requires write ordering between the log entry and the data block being written. As a result, each write operation must be preceded by log generation and persistence, which has a large overhead since the log flush uses what is called a persist barrier. The barrier, which typically consists of several cache flush instructions followed by a store fence (clwb and sfence on x86), stalls the pipeline for at least the amount of time required for a round-trip between the cache hierarchy and the NVM controller, which is usually a few hundreds of cycles, causing performance degradation.

Redo logging, on the other hand, does not require per-operation write ordering. As long as dirty data is written in-place after the transaction commits (which can be implemented as write ordering between all dirty blocks and the commit mark, but in practice, it is usually implemented as shadowing the objects in volatile memory), no extra write ordering is enforced. The paper points out, however, that redo logging needs read operations to also check the read log in order to access the most up-to-date value written by the current or earlier committed transactions, incurring a non-negligible overhead for reads. Since reads are more common than writes in most workloads, the overall performance may still be affected (again, this can be avoided by shadowing objects to volatile memory).

Shadow paging differs from logging approaches by not maintaining fixed home locations for data items. Instead, a global mapping table remaps objects from their old locations to a new copy, preserving the consistency of the old snapshot, which can be restored to if the transactions fails to commit due to a crash. Despite the fact that shadow paging eliminates explicit write ordering, the paper notes that shadow paging needs to enforce consistency between memory consistency ordering and the order of durability. In other words, the logical ordering of transactions in the memory consistency order must also be observed in persistence order (which, under the context of shadowing, is the order that the mapping table is updated), because otherwise, the recovery may result in a state that cannot be achieved by any serial execution of committed transactions. This can result in non-scalable designs, since in shadow paging, transactions are actually committed by atomically updating the centralized mapping table. In addition, in some designs, global ordering must be established using a single timestamp counter, which is difficult to scale.

The paper then notices that Hardware Transactional Memory is a potentially powerful mechanism for implementing durable transactions. Existing HTM implementations, however, disallow writing back cache lines from the hierarchy to the NVM, which will cause an immediate transaction abort because of the way speculative states are maintained in the hierarchy. Such restriction makes it impossible to generate and persist log entries during hardware transactions, and as a result, naive logging will not work.

Crafty, at a high level, adopts HTM for atomicity while avoiding the dilemma of logging with re-execution of the transaction body. The first “speculative” execution only generates undo log entries that will be persisted after the transaction, while the actual memory updates are discarded. This eliminates the write ordering requirement, since memory updates that correspond to the undo log entries have not been committed yet. In the second execution, memory updates are performed, which will be committed at the end. Both executions use HTM to ensure atomicity. From a different perspective, the HTM transaction body can be considered as a monolithic memory operation that completes atomically, whose updates can be determined in advance. Durability is achieved by first generating and persisting the undo log entries using the before value of these memory updates, and then actually executing the monolithic memory operation, assuming that its semantics remain consistent with the undo logs (i.e., memory locations to be updated and data to be written).

We next describe Crafty’s operations in details. During the first speculative execution, store operations are instrumented such that each store, before it takes effect on the address space, will first generate an undo log entry and a volatile redo log entry in separate private log buffers. Stores themselves are still executed such that later reads to the same address can access data written by an earlier store. The trick here is that write ordering between log entries and dirty blocks need not be enforced, as the HTM will prevent dirty blocks from being evicted. After executing the transaction body, the runtime then reverts all memory locations modified by the transaction by applying all undo log entries in the reverse direction (new-to-old). This is crucial, since Crafty does not enforce any write ordering between log entries and block data. An evicted dirty block updated by the transaction after the commit may pollute the memory image and cause data corruption that cannot be restored by undo logs.

After reverting memory modifications, a timestamp is obtained from a global and monotonically increasing timestamp counter (e.g., globally synchronized RDTSC). The timestamp, together with a LOGGED entry, is appended to the end of the undo log. The hardware transaction then commits, leaving the shared memory image unchanged, but with undo and redo log entries in their corresponding log buffers. The undo log is flushed back to the NVM using clwb. The paper also noted that no store fence is needed after the flush, since the next-stage execution with hardware transactions will enforce fence semantics before the transaction could begin (i.e., XBEGIN on x86).

In the next stage execution, we simply re-apply memory modifications from the transaction, which itself is wrapped within another hardware transaction. Since the second stage execution can only begin after all undo log entries have been persisted, we have thus established a write ordering between all log entries and all memory updates.

The second stage execution can take two forms: Simple replay of the redo log generated in the previous stage, or re-execution with value validation. The first form is used if no other transaction has committed between the commit of the first-stage and the begin of the second stage. This can be detected with a global commit timestamp variable that stores the timestamp of the last successfully committed timestamp. At the beginning of the second-stage transaction, the value of this variable is read and compared with the timestamp in the LOGGED entry. If the former is smaller, indicating that no transaction have successfully committed before the first-stage and the second-stage execution, the second-stage execution then just traverses the redo log entries in the order that they are created, and then apply these logs. At the end, the commit timestamp variable is updated to the current timestamp counter’s value, after which the second-stage execution commits. After the execution commits, the runtime then issues clwb instructions to write back all dirty blocks that have been written during the transaction. It also appends a COMMITTED mark at the end of the undo log to indicate that the transaction has been successfully committed. Again, no sfence is required, since these writes will eventually be persisted on the next transaction’s XBEGIN.

Note that, since the second-stage execution adds the commit timestamp variable to its read set by reading its value at the beginning, transaction that attempt to commit concurrently will abort due to conflicts on the variable. This essentially serializes second-stage execution, as also noted by the paper.

If the second stage execution aborts, either due to a conflict on shared data, on the commit timestamp variable, or because the commit timestamp check fails, the runtime performs value validation to check whether the speculative first-stage execution is still valid. Here, a “valid” speculative execution is defined as one that still generates the same value in the same order, and commits at the same position, if re-executed on the most up-to-date memory image. Note that Crafty does not care whether the control flow is actually identical between the first speculative run and the validation run. They are considered as identical as long as they change the memory image in precisely the same way. This is slightly more relaxed than checking the exact read-write conflicts for serializability, since it can be the case that conflicting writes are performed during the two stages, but these writes do not change the actual value as a net result. It may also be that the two executions actually differ by the code paths they take, but the data generated by the two different code paths are identical anyway.

As mentioned above, the validation is just a re-execution of the transaction body using the same input. The runtime checks values generated by the re-execution, and declares a success if, at the end of the execution, all values and addresses match, and the LOGGED record also match the commit of the end of the re-execution. The paper assumes that transaction bodies are functional (i.e., the output only depends on the input), and that input values are always available at their initial state (i.e., transactions must not alter their inputs, or if they do, there is some way of undoing it), such that re-execution is always viable. Memory locations are also updated as in a normal transaction during the re-execution. If re-execution succeeds, the runtime flushes all the dirty blocks as well as the COMMITTED log entry back to the NVM. The commit timestamp variable is also updated to the current global timestamp.

If re-execution fails, either because of address/data mismatch, or because the hardware transaction aborts due to data conflict, the entire transaction fails. The runtime will simply re-attempt the first-stage execution. No state roll-back is needed, since the abort of the hardware transaction has automatically rolled back all second-stage memory updates.

At crash recovery time, the handler scans the log buffer of all threads, and rolls back transactions that are possibly half-way committed. The handler checks the last mark (not necessarily the last entry - just scan forward) in the log buffer. If the mark is COMMITTED, then no recovery is required, since either the entire transaction has committed, or the transaction is still running in the first stage, in which case no recovery is needed. If the mark is LOGGED, however, then the transaction may have completed the first and second stage, and is half-way writing back its dirty blocks. In this case, the undo entries between the LOGGED mark and the previous COMMITTED mark (or the log head, if there is no earlier log) are replayed, rolling back the transaction. At the system level, recovery must be performed in a reverse-chronological order: The handler always selects the largest LOGGED entry to replay, to avoid violating the logical execution order.