Performance Improvement via Always-Abort HTM



- Hide Paper Summary
Paper Title: Performance Improvement via Always-Abort HTM
Link: https://ieeexplore.ieee.org/document/8091221/
Year: PACT 2017
Keyword: HTM; Thread-Level Speculation



Back

Hardware transactional memory can improve the performance of lock-based systems by speculating on the code path that it has not yet been granted to execute. The speculation has an effect of warming up the cache and branch predictor, achieving similar effects as hardware prefetching, with significantly more flexibility. This paper proses an enhanced implementation of TATAS spin lock and ticket lock using a variant of current commercial HTM implementations, where the hardware transaction always aborts. Instead of having threads spinning on the lock, wasting cycles and introducing coherence traffic, the thread begins an always-abort transaction, and performs speculation as if it were executing a hardware transaction. The paper claims that by using always-abort speculation, the performance of lock-based systems can be boosted by at most 2.5 times.

The always-abort hardware transaction memory (AAHTM) design resembles that of Intel TSX. AAHTM_BEGIN starts an always-abort transaction. AAHTM_ABORT and AAHTM_END both abort the current transaction. AAHTM_TEST checks whether the processor is running an always-abort transaction. To distinguish the AA-mode from normal TSX execution, an extra bit is added into one of the control registers. AAHTM_TEST tests the flag and stores the result into a register. Other causes of aborts for TSX, such as cache set overflow, unsupported instructions or exceptions, would abort an AA-transaction as well.

The implementation of TAS spin lock takes advantage of AA-HTM as follows. Instead of spinning on the lock variable, the worker thread begins an AA-transaction if the lock acquisition fails. The AA-transaction runs the critical section speculatively. When the thread compeletes running the critical section, or when the transaction aborts, the thread retries acquiring the lock. The same is repeated if lock acquisition fails again. For a ticket lock, as worker threads enters the critical section in an FIFO manner consistent with the order they are queued on the lock, the thread does not have to immediately start the transaction. Instead, it may spin on the lock variable for a while, and only begins the transaction when the difference between the two variables of the ticket lock are below a threshold. This trick avoids the speculative thread running too early and thus stealing cache lines from the thread currently inside the critical section.

Two optimizations are applicable to the TAS spin lock with AA-HTM. The first optimization reduces the wastage of memory bandwidth when a cache line is prefetched but not used. This happens if a transaction aborts early but the thread fails to acquire the lock, allowing another thread to enter the critical section and invalidates the cache line it prefetches. To solve the problem, the TAS spin lock is coded in a way that promotes the priority of waiting threads that have finished the AA-transaction. The spin lock has a counter whose value represents the number of threads that have completed running the AA-transaction but are still queued to wait. When a thread attempts to acquire the lock, it first checks the value of the counter. The thread does not acqiure the lock as long as the counter is non-zero.

The second optimization uses an alternative code path for AA-HTM. The alternative path preserves the branching and data access patterns of the critical section, but eliminates instructions and system calls that can cause the AA-HTM to fail. Since AA-HTM never changes visible global states after it aborts, using a functionally incorrect but somehow simpler code path improves the opportunity that AA-HTM can run further and hence brings more benefit. The alternative path can be manually crafted, or auto-generated by the compiler.

The authors of the paper also noted that, although AA-HTM can be simulated effeciently using current commercial HTM implementations, especially TSX, certain corner cases may produce unpredictable behavior. Using TSX, AA-HTM is implemented as a hardware transaction that always ends up aborting itself. This, however, may not be guaranteed if the HTM reads an inconsistent value during execution. If an inconsistent value is used as the target address of a virtual function or indirect jump, then the control flow may be directed to a piece of memory where an XCOMMIT instruction is accidently executed, materializing all speculative states. Fortunately, this “lazy subscription” problem did not occur in any benchmark used for evaluation.