Reduced Hardware NORec: A Safe and Scalable Hybrid TRansactional Memory



- Hide Paper Summary
Paper Title: Reduced Hardware NORec: A Safe and Scalable Hybrid TRansactional Memory
Link: https://dl.acm.org/citation.cfm?id=2694393
Year: ASPLOS 2015
Keyword: HTM; STM; NORec; Hybrid TM

- Hide HTM Summary
Conflict Detection: Hybrid
Conflict Resolution: Hybrid
Version Management: Hybrid
R/W Bookkeeping: Hybrid



Back

**To summarize: NORec family transactional memory has two invariants:

  1. Any update to shared data must be reflected on incrementing of the global counter;
  2. Writes must be serialized, i.e. write stages must not overlap.**

This paper presents hardware NORec, a hybrid transactional memory design based on the lightweight NORec. The paper identifies that some aspects in STM design will cause slowdown of execution, making single thread performance of STM several times slower than non-instrumented execution. For example, NORec requires every transactional load and store be instrumented, in order to maintain a read and write set for the transaction. This causes notable instruction explosion and metadata overhead, which decreases both the IPC due to excessive cache misses (both icache and dcache since instruction traces are also larger), and instruction throughput. What is more, the conflict resolution protocol in NORec relies on a single atomic global counter serving as the serialization point for concurrent transaction. A reader transaction must check the counter on every transactional load to maintain opacity. If the check fails, meaning that the transaction may access inconsistent data, it must abort and retry.

Hardware NORec is based on NORec, a previously proposed pure software TM scheme. NORec eliminates the requirement for having a direct-mapped or hashed metadata field for every accessing unit (i.e. the unit of version management and conflict detection) by using a global counter for serialization, and by using value validation. The global counter is consists of two fields. The first one bit field on the highest bit is a lock bit, which indicates whether some transaction is currently committing. The second field is a version field, which indicates the current read version, which is also the most up-to-date committed version. The counter is stored in globally accessible memory as a properly aligned machine word, such that the value of the counter can be loaded and atomically swapped using special instructions in the ISA. NORec works as follows. First, on transaction begin, the transaction reads the global counter and saves its value into a local variable as the local counter. Then, for every read operation, the instrumented code loads the value from memory, and checks whether the current value of the clock equals the local counter, and that whether the global clock is unlocked. If true, then the read succeeds, because at the time of the read (actually slightly longer, since the check is performed after the load), no transaction has committed yet. Otherwise, the transaction must either abort, or perform value-based validation. The value read by load instructions must also be stored in a read log in order to be later used by value validation. On transaction writes, the value is inserted into a local write log, which is not published to shared memory until transaction commit. On transaction commit, the transaction first locks the global clock by atomically swapping and setting the locked bit. Then a value validation is performed, if (1) the transaction is not read-only; and (2) the value of the global clock does not equal the value of the local counter (i.e. some transactions have committed). The value validation process iterates over the read log, and compares whether the value in the read log equals the current value in shared memory. Validiation succeeds if all values are idientical. The transaction finally commits by writing back all entries in the write set to the shared memory. The global lock is released atomically with the clock value incremented by one. The last step can be done by a single store instruction since on most platforms, an aligned store instruction is atomic. Note that all value validations except the last one can be non-atomic, i.e. they can tolerate concurrent committing transactions as long as the value does not change, since a non-atomic value validation only validates the transaction at the logical time of the first read issued as the validation check. For the last validation, concurrent committing transactions must be blocked, since we must guarantee that the transaction is valid during the entire write stage. Otherwise, a committing transaction may overwrite an item X read by the transaction, which is later written by the same transaction, causing a dependency cycle as the current transaction is serialized before the committing transaction by not observing its write, but in the meantime also serialized after that transaction by overwriting its write.

The hardware NORec optimizes over NORec in two aspects. First, if the read phase of transactions can be made atomic via regular HTM (e.g. TSX), then what we need is to simply perform read as an atomic unit, and lock the version counter, and then commit the read phase. This way the read is guaranteed to be atomic with writes, in addition to the fact that all reads performed within the hardware transaction area are also considered as atomic at the commit point, which eliminates read validation entirely. Second, if the write phase of transactions can also be made atomic, then we do not have to maintain a write log (read logs are already eliminated by making read atomic, since reads no longer require validation), since the hardware will buffer writes. We now describe the three types of transactions in hardware NORec with different degrees of atomicity provided by hardware.

The first type is pure hardware transaction, which contains both reads and writes. In the pure hardware transaction,
all reads and writes within the transaction can be considered as happened at the time of the transaction commit, given that no conflict is detected. To synchronize properly with software NORec, the commit should not happen when a software transaction is currently committing, because otherwise it is possible that the write set of the hardware NORec transaction overlaps with both the read and write set of the software transaction, causing a dependency cycle. To achieve this, the hardware transaction first checks whether the global clock is locked before it commits. If positive, the hardware transaction will abort. Otherwise, the transaction increments the global clock (unlocked), and then commits. As an optimization, the hardware transaction should also check whether any software transaction is currently running. If negative (which also adds the counter of software transaction into the read set, such that if any software transaction joins, the hardware transaction will abort), then no increment to the global clock is needed, since no software transaction will observe these changes anyway. The global clock increment before transaction begin is necessary to ensure that software transaction can detect a clock change and then validate its read set.

The second type of transaction is the non-atomic read but atomic write type. In such a transaction, the read phase is entirely performed in software by instrumentation. The transaction begins as in NORec by reading the current value of the global clock and saving it as the local clock. Every read operation must therefore first access memory and then check whether the two clocks still match. In the case of a change, the current transaction will abort without causing value validation, since in this scheme no read log is maintained. In this stage, both committed hardware and software transaction will cause the check to fail. The transaction switches to write mode when the first write is issued. In this case, the transaction acquires the global clock using a compare-and-swap with value of the local clock as the old, and the local clock with the bit set as new. This CAS serves two purposes atomically. First, it checks that the read image has not changed at the point of the CAS by comparing the value of the two clocks. Second, it ensures that no changes to the entire address space can be made after the CAS by setting the lock bit. Then, all loads and stores can be executed directly on the memory address without having to check the clock and to insert entries into the log. The transaction commits by committing the hardware transaction, which publishes all writes to global memory.

The last type of transaction is atomic read and atomic write, provisioned by two independent hardware transactions. It further optimizes over the previous type on read abort rate, since in the non-atomic read type, every read instruction still has to check the value of the global clock to ensure atomicity in software. To achieve this, a hardware transaction is started at the beginning of the transaction. All loads within the hardware transaction are protected, and is guaranteed to happen atomically at the current point in time before the transaction commit. When the transaction issues the first store instruction, we first read the global clock, and then commit the read-only hardware transaction. Reading the global clock before transaction commit guarantees that the clock is also atomic with all loads in the transaction, i.e. they appear to happen in the same logical time. Otherwise, a transaction may commit between transaction commit and the read of the global clock, making reads invalid, which cannot be detected. After that, the transaction switches mode, and begins the write transaction by atomically locking and validating the global clock. The write mode transaction works exactly the same as the second type, which has been described in the previous paragraph.