Pisces: A Scalable and Efficient Persistent Transactional Memory

- Hide Paper Summary

Paper Title: Pisces: A Scalable and Efficient Persistent Transactional Memory
Link: https://www.usenix.org/system/files/atc19-gu_0.pdf
Year: USENIX ATC 2019
Keyword: NVM; Transaction; Pisces

Back

This paper presents Pisces, a software transactional memory design for Non-Volatile DIMM. Pisces uses a limited form of multiversion concurrency control (MVCC) protocol, dual version concurrency control (DVCC), to implement snapshot-isolation (SI) transactional semantics. The paper points out that SI is feasible for NVM compared with serializability in terms of blocked readers. For serializable semantics, readers must wait for a dirty data item to be fully persisted and the transaction to be committed, before the read transaction does. On recovery, the logs of transactions must be replayed on the order consistent with data dependencies (also transaction’s serialization order) to avoid incorrectly overwriting a value. For SI, however, only write-write conflicts are detected, and transactions are ordered by the order that they overwrite data items. This prevents reading transactions from being blocked by a pending writing transaction it reads from, which decreases transaction commit latency.

The paper assumes a baseline MVCC model as follows. Transactions maintain a private write set for storing uncommitted data. Data items are maintained as a linked list of versions, each having a timestamp indicating the commit time of the transaction that creates it. Only committed version is stored in the version chain. Uncommitted versions is buffered privately as described above. A global timestamp counter maintains the current logical time. On transaction begin, the transaction reads the value of the global timestamp counter as the begin timestamp, bts. The bts is used when accessing data items for read and writes. On transactional reads, the version chain of the data item is traversed until the oldest version whose timestamp is smaller than or equal to the bts is reached. This version is then accessed directly from the version chain to fulfill the read. For writes, the version chain is traversed in the same manner as in reads. The version is duplicated in the private buffer of the writing thread, before it can be updated by the write. If the data item is already in the buffer, then writes will be performed on the data item without traversing the version chain. On commit, the transaction first locks all modified data items by acquiring locks on the version chain. The lock blocks both concurrent commits and accesses, the reason of which will be explained below. After all data items are locked, the transaction validates itself by checking whether any other transaction has committed on the locked items since it has started. This is done by comparing whether the latest commit timestamp is larger than bts. If true, the transaction has detected a write-write conflict, which leads to an abort. If not, the transaction first acquires a commit timestamp, cts, by atomically incrementing the global counter, and uses the new value as cts. To ensure failure-atomicity of updates, the transaction then writes a redo log to the NVM for persistence. The redo log consists of dirty data items it has created in the write buffer. After the redo log is persisted, the transaction updates version chain by copying updated versions with its cts as version timestamp. In the last step, the transaction releases all locks in the write set, and completes. On recovery, redo logs are replayed in the order of the commit timestamp to reflect the correct write ordering.

The paper identifies two issues with the above simple MVCC model. First, traversing version chains has a large overhead, both because the reader thread must examine a large number of version timestamps, but also because the pointer chasing and the resulting cache misses. The second issue is that readers are still blocked while a transaction commits on the data item it intends to read. This is because the committing transaction acquires the commit timestamp before it persists the write set. After the cts is acquired, all later transactions shuold be able to read committed data of the committing transaction. This, however, may not always be possible, since the operation of updating version chains for the entire write set is most likely not atomic. If reader threads are not blocked, then it is possible that the thread accesses an incorrect version simply because the committing transaction has not updated the version chain to add its own version yet. In the above transaction model, the reader transaction must wait for the committing transaction to fully add its write set to the version chain before it can proceed to read.

Pisces solves the above two issues by adapting the following. First, only two versions are allowed, one current version, another newer version. The current version is accessed in-place as the data item is normally stored. The newer version exists in a transactional-local log, which is exposed to readers after the writing transaction commits. At most one pointer chasing is required to find the correct version to read. To avoid having threads reading even older and non-existing versions, when we update the current version in-place, we must wait for such threads to exit before the current version is updated. To solve the second problem, we no longer associate a commit timestamp with every data item. Instead, versions (in fact, only the newer version) only contain a pointer to the transaction that has created it. When the commit timestamp is updated, we only update the commit timestamp field in the transaction descriptor, which can be done atomically. The timestamp of the newer version can then be read by following the pointer to the transaction descriptor in which the cts can be found, if exists.

We describe the data structure of versions and transaction descriptors as follows. In Pisces, at most two versions are maintained for each data item, one current version, which represents the last stable (committed) image of the object, and a newer version, which represents a version created by a later thread. Both versions can be accessed for read, but only the current version can be accessed for write. The current version contains the object itself, and a pointer to the newer version object. The pointer is set to NULL if there is no newer version. The pointer is also used as lock for blocking concurrent writers. Pisces eagerly locks data items to be updated during the read phase, which eliminates the need for validation, at the cost of slightly lower parallelism. The newer version object contains a pointer to the transaction’s descriptor for accessing the commit timestamp of the version, as we have discussed above. In addition, the object contains a “persist timestamp”, pts, to indicate log replay order during crash recovery. A pointer of the original current object is also included in the newer version object. The newer object is mapped to NVM logging area, while the current object is mapped to the process’s persistent heap as usual.

Each transaction has a descriptor which stores the metadata of the transaction, including its begin ts, commit ts, and a status variable. The descriptor is stored in shared DRAM area, which will be accessed by other transactions during the current transaction.

Pisces operates as follows. Transaction begin does not change as in the baseline MVCC model. In addition, the transaction also initializes the desdriptor by setting bts to the bts acquired from the global counter, and setting cts to positive infinity. This is to ensure that objects created by this transaction is invisible to concurrent reading transactions before cts is set to the actual commit timestamp. On transactional reads, the instrumented read instruction first checks whether the pointer of the current object is NULL. If true, then there is no second version to read, and we access the current version. Otherwise, we find the new version object, as well as the transaction descriptor. We compare the writer transaction’s cts against the current cts. If the latter is larger, then we read the newer version, since the writer transaction has committed, and it is serialized before the current transaction in terms of SI. If the latter is smaller, however, the writer has not committed yet, and we still read the current version.

On transactional writes, we check the pointer in the current version as well. If the pointer is non-empty, the transaction will abort immediately, since there is an uncommitted writer, incurring an eager write-write conflict. Otherwise, the transaction allocates a log entry on the NVM as the newer version, and attempts to install the pointer using atomic CAS. If the CAS succeeds, the object is successfully locked by the current transaction, granting full access permission until transaction commit. Otherwise, the transaction should abort. The current version is then duplicated in the newer version, and all later reads and updates are also made to the newer version. The transaction also maintains a set of pointers to the newer versions as its write set.

On transaction commit, no validation is performed, since objects are locked as soon as they are updated. The transaction can commit without any further action, if the write set is empty. This is a huge advantage compared with serializable transactional memory, since read-only transactions constitute a non-negligible part of total transactions. The committing transaction obtains the cts via an atomic fetch-increment of the global counter. The transaction logically commits by setting the cts in the transaction descriptor using the commit timestamp. Note that there is a short window after the global counter is updated and before cts in the descriptor is set, in which new transactions could start, read the object, finding the object being invisible since the cts has not been set yet. In this case the new version will be ignored, since the cts is still infinity. This is, however, incorrect, since the timestamp of the new transaction equals the commit time of the current transaction, suggesting that the new version should be visible to the new transaction. To prevent the race condition from happening, the paper proposes using a lock variable “inCritical” to serialize this window with concurrent reading transactions. “inCritical” is set to true before the transaction updates the global counter and cts in its descriptor, and to false after this is done (a fence may also be needed, as implied by the pseudocode in the paper).

After the logical commit, the transaction then flushes the write set into the log using cache line flushes and a persistence barrier. The cts in the descriptor is also written at the end of the log to indicate that the log object is completed (otherwise if system crashes while we are flushing the log, we may not be able to distinguish the integrity of the log). At this point, the transaction is also logically committed with regard to failures.

To conclude the commit process, the transaction then writes back new version to the current version after older transactions that can still access the current version exit. These older transactions have a begin timestamp smaller than the current commit timestamp. Write back is performed by copying the new version back to the current version, and clearing the pointer in the current version. This also unblocks pending writers on the object, if any.

The newer version should not be removed after transaction commit, since there can potentially be threads started after the logical commit and before the completion of the commit process accessing the newer version. In theory, a log entry can no longer be accessed after the pointer is removed from the current version. In practice, Pisces treats all log objects as inaccessible when the next writer on the same object commits. The newer version can be removed after the commit of the next writing transaction, since it also waits for earlier threads (started before its cts) to exit before commit concludes.

After all log entries of a transaction is marked as invalid, the entire transaction’s log can be reclaimed for GC. This, however, must be done in the order indicated by their persistence timestamp, since otherwise we may miss a later update with larger pts, while we apply an earlier update. The paper proposes an epoch-based approach for reclaiming transaction logs. We do not cover it here, since it is just commonly used epoch-based reclamation.

On recovery, transaction logs are located by inspecting each transaction’s local data. Logs are applied in the order indicated by their pts. Earlier log entries are applied first, followed by later entries. The system can resume execution after the log replay process.