OmniOrder: Directory-Based Conflict Serialization of Transactions

- Hide Paper Summary

Paper Title: OmniOrder: Directory-Based Conflict Serialization of Transactions
Link: https://ieeexplore.ieee.org/document/6853223/
Year: ISCA 2014
Keyword: HTM; OmniOrder; Sequential Consistency

- Hide HTM Summary

Conflict Detection: Eager
Conflict Resolution: Eager
Version Management: Eager (Uncommitted Read)
R/W Bookkeeping: On-chip buffer

Back

Conflict-based Hardware Transactional Memory (HTM) suffers from high abort rates when conflicts are frequent. This is caused by the simple conflict detection mechanism which treats any cache coherence message that hits transactional cache lines as potential sources of violation. The only exception to this rule is bus read shared request on shared or exclusive lines, as reader processor cannot conflict with other readers. Processors will abort and restart if a conflict is detected, discarding any speculative states from its private cache, because there is zero information for tracking how the cache line will be used by another processor. This property is sub-optimal, because not all dependencies imply violations at the end of the execution.

This paper proposes OmniOrder, an HTM extension on existing commercial HTM such as Intel TSX to support more efficient dependency reasoning. Instead of keeping zero information about global usage of a cache line, processors are extended with a few structures that can track the modification history as well as dependencies of speculatively modified cache lines. One of the highlights in the design of OmniOrder is its native support for unmodified directory-based cache coherence protocol. This feature is crucial for a practical HTM design, for two reasons. First, directory based protocol is a must in today’s high performance architecture. Inability to support directory based protocol is a huge deficiency. In contrast, some HTM designs require broadcast cache coherence protocol, which makes the design non-scalable and impossible to port to large scale systems. Second, unmodified coherence protocol implies only incremental change is needed on existing microarchitecture. Designing and verifying a correct coherence protocol is difficult. What makes it worse is that the actual number of state in the state machine is far more than the steady states. Transient states that handle race conditions must be added to ensure multiple operations can be performed in parallel. All these factors discourage the invention of a new coherence protocol. In the next paragraph we cover in detail the hardware changes required to implement OmniOrder.

In OmniOrder, each processor is extended with two buffers: One L0 cache to hold speculative values it has written on the cache line, and a Speculative Version Buffer (SVB) that holds the global modification history of cache lines currently in its private cache. Both can be organized in a similar way as caches, using a few bits from the address to select cache set, and then use tag comparison to locate the way. Four bit vectors are also added to each processor. The first two bit vectors are successor vector and predecessor vector. They record read/write dependencies into which the current processor participates, as sources and destinations, respectively. The third bit vector is called the squash vector, which records the Read-after-Write dependency in which the processor is the destination. The name of the vector suggests its function: When the processor reads uncommitted data from another one, and the latter aborts, the current processor must also abort to maintain isolation property. The width of the first three vectors is the number of processors in the system. The last vector is the directory successor vector. Its width is the number of directories. A bit is set if speculative data generated by this processor is maintained and tracked by the directory. We cover more details about directories acting as “proxies” for processors that have shared their uncommitted data. L0 and SVB must not be written back to memory, and not visible to non-transactional instructions. If an overflow is inevitable, the current transaction must abort. This implies that OmniOrder is not unbounded.

The directory is also enhanced with an SVB and an array of bit vectors. The SVB in the directory should be large enough to hold all SVB entries from all processors. Each element in the bit vector array stores the successor of a particular processor. A bit in the vector is set if the corresponding processor is a successor of the processor represented by the identity of the bit vector. Similar to the per-processor SVB, if the directory SVB overflows, the transaction must abort.

OmniOrder maintains a few invariants. First, speculative data is never stored in ordinary caches. They are always stored in L0 and SVB, and is not visible to non-transactional reads. Second, the coherence of SVB implicitly follows the coherence of its corresponding cache line. If the cache line exists in multiple caches in Shared state, then multiple copies of the SVB also exist and they are consistent with each other. Similarly, if a cache line is in M state, then the SVB entry of the line is also the most up-to-date, and no other copies can ever exist.

On transactional load, a cache coherence request is sent as usual. The paper only assumes MSI baseline protocol, so exclusive state is not under consideration. If the request hits a local cache line, then data in A0 is read-forward to the processor. Otherwise, the processor checks SVB entry for uncommitted data. If the request hits a speculative and modified line, with uncommitted data in both L0 and SVB, then the source processor first combines the L0 and SVB entry, and then writes them back to the directory. The directory stores the combined line in its own SVB, marks the requesting processor in responding processor’s bit vector as a successor, and provides the requesting processor with the SVB entry and line. The responding processor also sets the directory it sends the SVB entry to in the directory bit vector. From then on, the directory acts as a proxy for the SVB entry and cache line.

If the request hits a speculative and shared line, then it must be the case that those shared lines were produced by one of the processors currently holding the line, and a read operation hit the line in M state. In this case the directory must already have an SVB entry of the line as a result of the previous read. The directory therefore, as expected, provides the line as well as the SVB entry. The directory also marks the requesting processor in the last writer’s bit vector as a successor. It infers the identity of the last writer using the modification history in SVB. In either case, the rquesting processor marks the bit of the responding processor in its predecessor vector. If the processor reads uncommitted data written by another processor, it also needs to mark the processor in its squash bit vector. This is necessary to guarantee isolation, as otherwise transactions may use data items that are written by an aborted transaction.

On transactional store, exclusive permission is first obtained though the coherence protocol. If the store hits a local M state line, no other operations are needed except buffering the modification in L0. Note that processors never directly write into SVB. If the store hits a remote M state line, then the coherence message from the responding processor contains both the line and the SVB entry. The requesting and responding processor mark each other as predecessor and successor accordingly. The directory is not involved except for forwarding the coherence message.

If the store operation hits a local S state, or remote speculative and shared S state, then coherence requires that these states should be invalidated first. As described above, when this happens, it must be the case that a speculatively written cache line was read by another processor, and the directory had become the proxy for the line. In this case, the directory invalidates all S state copies (except the one in the requesting processor, if any). Then, the directory sends the requesting processor the SVB entry as well as the line. The requesting processor discards them if it already has the line in S state. Otherwise it stores them in the SVB and cache respectively. All processors that were holding the line in S state set the bit in successor bit vector. The directory also sends to the rquesting processor the identities of all sharing processors. The requesting processor sets all of them as its predecessor. Note that the case for write differs from the case for read, in that the directory does not set the vector. Instead, all sharers should set successor vectors individually, as the write operation in WAR dependency can only be committed after all readers commit. For RAW, all readers can commit as long as the writer commits.

On commit instruction, the processor first waits for its predecessor vector to clear. In the meantime, the processor stalls its instruction stream while continuing to react to coherence and commit/abort requests. Once the vector is clear, the committing processor sends the commit signal to all its successor processors and successor directories. The directory then forwards the signal to all successors in the successor vector in the directory. On receiving a commit signal, all processors inspect their SVB caches, and merges the modification of the committing processor into the non-speculative cache line. The committing processor is cleared from the predecessor vector. In case the commit signal arrives out-of-order, the signal is buffered until the modification of the committing processor is the earliest in the SVB entry. Multiple commit signals for the same transaction do not need any special processing, although the paper did not mention how to distinguish between commits from a new transaction on the same processor and the old commit that was delivered late. After commit finishes, the committing processor clears the bit vectors and L0 cache. SVB should not be cleared as it holds uncommitted data from other processors.

On abort instruction or on potential dependency violations, the transaction must abort. Abort does not wait for any condition, and takes place immediately. The aborting processor propagates the signal as in the case of commit to all its successor processors and directories. It then clears the local L0 cache. Note that local SVB never needs to be cleared as the processor never writes into its own SVB. On receiving the abort signal, processors and directories clear the SVB modification of the aborting processor. They also clear the processor from its successor and predecessor vector. If the aborting processor is in the squash vector of a receiving processor, then the latter also forces an abort, because it has a RAW dependency, and the data it read no longer exists. The abort processor must keep serving coherence and propagating commit/abort messages until all its predecessor are cleared.

OmniOrder uses simple heuristics for conflict detection. Since each processor only keeps local dependency information, it would be difficult to obtain precise intelligence about dependency cycles without global coordination. OmniOrder, on the other hand, takes advantage of the simple observation that in order for a cycle to form, every node in the cycle must have at least on inbound and one outbound dependency. Based on this observation, OmniOrder monitors the predecrssor set when it is modified. If a processor becomes the predecessor of another processor, and its two successor vectors are non-empty, then it concludes that this is a dangerous structure, which may lead to dependency cycles. The processor then aborts the current transaction. To avoid a transaction should self-abort due to conflicts, after the first several conflict abort, it falls back to normal HTM execution, where no conflict is tracked. The same technique can also be applied to the scenario where the hardware bookkeeping structures overflow.