Stretching Transactional Memory

- Hide Paper Summary

Paper Title: Stretching Transactional Memory
Link: https://dl.acm.org/citation.cfm?id=1542494
Year: PLDI 2009
Keyword: STM; SwissTM; TL2

- Hide HTM Summary

Conflict Detection: Eager for W/W; Lazy for R/W
Conflict Resolution: Eager for W/W; Lazy for R/W
Version Management: Lazy
R/W Bookkeeping: Log

Back

Classical state-of-the-art Software Transactional Memory (STM) design, such as TL2, uses lazy conflict detection and resolution, and allows higher degree of parallelism where multiple transactions could pre-write a data item and transactions are only serialized at commit time. TL2 style STM relies on timestamps to synchronize transactional reads and writes. A global timestamp counter, which can be implemented as either a simple integer counter or a more complicated per-thread counter pool, provides timestamp values at transaction begin and commit time. At transaction begin, a begin timestamp (bt) is obtained by reading the current value of the counter. At transaction commit time, the counter is atomically fetch-and-incremented and the after value is obtained as the commit timestamp. Data items are associated with timestamps and locks. The timestamp records the commit timestamp of the most recent transaction that wrote to the item, and the lock is a one-bit flag that indicates the item is under update. Both are stored in the same machine word and can hence be fetched and updated atomically. Transactions keep a read set and write set during read phase execution. The read set maintains addresses of read values as well as the timestamps when they are being feteched. The read is guaranteed to be consistent regardless of concurrent writers by sampling the write timestamp of the data item before and after thr read. The two samples are compared, and the read is considered as consistent if both samples are unlocked, and the wtite timestamps agree. If the version of the data item is greater than bt, then the transaction aborts, because the data item is generated after the snapshot is taken at time bt. Writes are buffered in the thread-local write log, which contains the address of the write and the updated value of data items. At commit time, the commit ptotocol first locks all data items in the write set. If any lock conflict occurs, to avoid deadlock, the transaction must abort. After acquring all locks, the transaction performs a read validation. The validation routine compares the most up-to-date write timestamps of data items in the read set, and aborts the transaction if any of them has changed, which indicates Write-After-Read (WAR) anti-dependency. After successful validation, the transaction obtains its ct, and then enters the write phase, in which dirty items are written back. The timestamps of data items are also updated to the commit timestamp of the tranaction, and locks on updated data items are released. This can be done using one store operation.

TL2 is sub-optimal in some cases where spurious aborts can occur and degrade thruoghput. This is because TL2 fixes the begin timestamp at transaction begin time, and sticks to it for validation. This, however, is overly restrictive. Imagine the scenario where a new data item is being read, and the timestamp of the new data item is greater than the bt. TL2 will abort the transaction because the data item is generated by a transaction that logically happens after the current transaction fixes the read snapshot. The abort in this case can be avoided by promoting the bt to the current logical timestamp, after a successful validation of the read set. The reasoning behind this is that, as long as the content of the read set still matches the most up-to-date values of data items, it makes no difference for read operations to occur at current logical time, or to occur at their physical time. One extra benefit is that read-only transactions in this case does not have to perform validation after all read operations, if the value of bt equals current timestamp value. In some publications this technique is called “extensible timestamps”, and has been adopted by NORec and LSA-STM. By extending the begin timestamp whenever possible, the transaction suffers from less spurious aborts and can sustain higher throughput.

This paper argues that TL2, regardless of the optimization we describe above, is still sub-optimal for a few reasons. First, while the lazy detection and resolution mechanism of read-write conflicts provides good degree of parallelism, the paper pointed out that lazy detection of “pre-write”-“pre-write” conflicts negatively impact the performance. The paper recognizes the fact that in most cases, data items in the write set are also in the read set, i.e. only few transactions blindly write a data item without reading it first. A “pre-write”-“pre-write” conflict almost always also indicates that a future read-write conflict will happen during the validation of the transaction that committed latter. In this case, allowing both to proceed until commit time wastes processor cycles, because at most one of them can commit successfully. Second, the contention management scheme of TL2, called “timid”, which always aborts transactions that observe the locked item, favors short transactions. For long transactions, the “timid” policy causes wasted work, if the transaction could actually commit by waiting for the lock. the situation can be aggravated if the transaction that aborted the current one later itself aborts. On the other end of the spectrum, the “Greedy” policy assigns each transaction a timestamp at transaction start, and renew the timestamp on every transaction abort. The Greedy contention manager chooses the “younger” transaction whose timestamp is smaller as the victim. This scheme favors larger transaction, because the overhead of reading and incrementing the global counter can be a performance bottleneck for smaller ones. Starvation is impossible with “Greedy”, as newer transactions always have larger timestamp and thus will not be easily aborted. Finally, on every transactional read operation, TL2 needs to search its write set first in order to forward dirty values instead of reading directly from the global storage. If not implemented properly, the write set can also become a major source of performance loss, as searching the write set usually requires checking every or at least several items that the transaction has written, and thus has an overhead that is dependent on the working set size.

This paper proposes SwissTM, aiming to solve the above problems of existing TL2-based STM designs. SwissTM distinguishes itself from TL2 by the following features. First, SwissTM eagerly acquire writer locks when a data item is pre-written during the read phase. Writer locks do not block concurrent readers, but rather will cause a concurrent transaction that pre-writes the same data item to invoke the contention manager. We delay the description of the contention manager a little. The invariant of the writer lock is that, once it is acquired, the owner transaction is guaranteed to have exclusive write access to the item until it commits or aborts. This efficiently detects “pre-write”-“pre-write” conflicts, while allows read-write conflicts to be resolved until commit time. In other words, if a reader transaction observes the write lock, it is still possible for the reader to commit as long as it enters validation phase before the writer transaction commits. “Pre-write”-“pre-write” conflicts, as we have analyzed above, do not have such property, and is better to be aborted eagerly when they are detected. One extra bouns of acquiring writer lock on data items upon pre-write is that, since the data item is guaranteed not to appear in other transaction’s write set, it is convenient for the current transaction to store the address of the write set entry in the remaining bits of the lock word (the lock itself only occupies a bit). On a transactional read operation, the owner transaction could find the write set entry in constant time using just one pointer dereference, while other transactions can still access the original data item. Compared with eager writer designs, if the writer transaction replaces the old data item with the updated value, reader transactions could not be able to proceed, because the last committed value is nowhere to be found. On transaction commit, the transaction acquires another type of lock called reader lock. Read locks functions exactly like the commit time lock in TL2, blocking reading transactions from accessing the data item. Since writer locks already guarantee that data items will be pre-written by only one transaction, reader locks is acquired just by writing into its lock bit, instead of Compare-and-Swap (CAS). The second feature of SwissTM is its two-phase contention manager. In classical schemes such as timid and Greedy, the contention manager favors either large or small transactions. As an alternative, SwissTM uses a scheme that is both friendly to large transactions by avoiding aborting them if not strictly unnecessary, and does not hurt small transactions because small transactions do not fetch-and-increment global counters as Greedy did. The contention manager is only invoked on writer-writer conflicts. At transaction begin, each transaction is assigned a weight of positive infinite, which means the current transaction is “small” and should always be favored during a conflict resolution. On every successful pre-write lock acquire, the size of the write set is checked, and if it equals a certain threshold, then the transaction atomically fetch-and-increments a global counter and saves it as the weight. The weight is set at most once during the read phase, and the earlier the transaction reaches the threshold, the smaller its weight will be. Then, when a “pre-write”-“pre-write” conflict occurs, the contention manager peeks into the current holder of the write lock, and compares the weight of the two transactions. The one with a smaller wight will be aborted, which prevents starvation as Greedy does without incurring the overhead of incrementing a global counter. Finally, SwissTM locks data items in 4-word (16 bytes) granularity. Compared with TL2’s word level locking, using 4 word chunk as the locking unit has the best empirical result while having a reasonable overhead.