Hardware Transactional Memory for GPU Architectures
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2155655
Year: MICRO 2011
Keyword: HTM; GPU; KiloTM
This paper proposes KiloTM, a hardware transactional memory design for GPGPU. This paper identifies several benefits of introducing HTM to GPGPU. First, GPGPU features heavily threaded workloads. Interactions between threads are difficult to manage, which poses great challenge for debugging and program verification. Second, the SIMT execution model used by GPGPUs can cause deadlock for fine-grained locking. Threads waiting for a lock that has already been taken by some other threads can be permanently blocked due to the fact that SIMT introduces new implicit control dependencies between threads. GPGPUs do not attempt to solve this problem, but rather, they offload the responsibility of finding synchronization bugs to programmers. In this situation, HTM can be of great help since they are non-blocking, and enables fast prototyping of concurrent data structures at the cost of slightly slower execution.
Although numerous HTM designs have been proposed, and few are already implemented on commercial CPUs, designing a working HTM for GPGPUs is still a challenging task due to the different data accessing pattern of GPU programs. The paper lists several of them. First, while traditional CPU-based HTM can maintain transactional states in the L1 private cache at the cache line granularity, this can result in massive false sharing for GPU workloads, since GPU transactions tend to access adjacent words within a cache line. Second, while CPU-based HTM can detect conflicts using the existing coherence protocol, GPUs typically do not possess coherent L1 caches. In fact, the paper points out that the private cache of a SIMT core is only used to access private data, and when global data is to be accessed by the core, the private cache will not be used. Broadcasting all accesses on the bus is also not acceptable, since every thread will send a message to all others when a memory instruction is executed, incurring O(n^2) traffic for n threads. The last challenge is commit overhead. For massively threaded GPU applications, transaction commit can easily become a bottleneck if transactions are validated in a serial manner. Even if some form of serialization is inevitable to ensure correct ordering, a properly designed commit algorithm should reduce this extra serialization to a minimum.
KiloTM involves three major hardware changes. The first change is on the SIMT execution model by allowing transactions to be retried after they abort. In a classical stack-based SIMT execution model, any diverging instruction such as branches will result in two records to be pushed onto the SIMT stack, each representing an execution path controlled by a bit mask. In KiloTM, the ISA is extended with a transaction begin and commit instruction, which are both diverging instructions. On transaction begin, two records, R and T, are pushed onto the stack, with the PC pointing to the next instruction after transaction begin, and the RPC undefined. The R record is pushed first, which records failed transactions in the current round. The bit mask of the R record is initialized to empty. The T record is pushed after R record, which has an all-set bit mask, and the SIMT core will start execution of the transaction body immediately using the T record. On commit, transactions are validated. Those that are successfully committed are removed from the bit mask, and those aborted set the corresponding bits in the bit mask of the R record. The transaction commit instruction pops the T record off the stack, and inserts a new T record initialized from the R record, after which the bit mask in the R record is cleared. This allows transactions that failed validation to be retried until they succeed, at which moment both T and R record have an empty bit mask. In this case, both records are popped from the stack, and execution continues with the next entry on the stack.
The second major hardware change of KiloTM is version management. Instead of maintaining read and write sets using bits in the cache tag, KiloTM stores reads and writes in the form of (addr, value) tuples in the on-chip DRAM. To leverage caching and private scratchpad on GPGPU, log entries generated by the same instruction are mapped to one consecutive chunk of memory, such that the write coalescing hardware can efficiently stream this entry back to DRAM in one batch. While the transaction is executed, no modification is made to shared memory. In order for a transaction to read its own dirty value, a per-thread bloom filter is also added to detect read-after-write within the same transaction. The write log is walked to forward dirty data to the read instruction if necessary.
The last major change is the addition of validation hardware. KiloTM adopts backward OCC with value validation. Transactions are assigned commit timestamps (cts) by a centralized counter when they attempt to validate at the end of the transaction. In general, a transaction can commit successfully if its read set is not written by any of the overlapping transaction during the execution. In a naive serial commit scheme, transactions are validated by the validation hardware serially. A validating transaction walks its read set and compares the value in the read log against the current value in the shared memory. If all values match, then the read set is considered to be consistent, and the transaction commits. This scheme, however, introduces intolerable delay as transactions are eventually serialized at the commit point. To increase parallelism, two optimizations are introduced into KiloTM. In the first optimization, the address space is partitioned into several parts, each with a piece of the validation hardware processing requests within the address range. This shortens the delay of processing a single validation, but still does not increase parallelism. The second optimization allows transactions to be validated in parallel. In the optimized scheme, transactions are validated in a pipelined manner. One potential problem is that if in-flight transactions conflict with each other (i.e. Write-after-Read), we cannot detect them since the write set is only published into shared memory at the last stage (if the transaction commits). Such conflicts are called “hazards”, and a separate hazard detection stage is added to the pipeline after the initial value validation is performed. To conduct hazard detection, transactions keep two extra pointers when they start validation. The first pointer is RCID, which points to the oldest transaction that is still in the pipeline when the current transaction begins validation. Writes to shared memory with a timestamp older than RCID can always be detected by the previous value validation stage, and hence will be ignored by hazard detection. The second pointer is YCID, which is the youngest transaction that may have a WAR conflict with the current transaction. The YCID pointer is set by hazard detection stage. If the YCID pointer is not null after this stage, the transaction will wait until the transaction whose cts equals YCID commits or aborts, and then re-validates. This gives the transaction a second chance to check conflicts, which can reduce spurious aborts in the presence of false positives given by the hazard detection stage. The second validation is non-speculative, and must be precise (by contrast, the first validation is speculative, and can even be conducted with a partial read set). Transactions commit at the last stage if the second validation succeeds, during which the write set is written back to the DRAM.
The hazard detection stage uses a lookup table called the Last Write History (LWH) table. Conceptually, the LWH table outputs the cts of the last writer given a memory address, while in reality, it does not have to give precide information. Transactions in the hazard detection stage query the LWH table with addresses in its read set, and computes the YCID. If the output of LWH table is older than RCID, the value will be ignored, as explained above. If the output is younger than both the RCID and the YCID (YCID is initialized to infinity), then the YCID is updated to the output value, meaning that the read address may have been updated by an earlier transaction, and that the current transaction should wait until that updating transaction commits before the second validation. To this end, the LWH table only needs to approximate the last writer such that the output cts of an address must be younger than or equal to the actual cts that last updates the address. The LWH table consists of two components. The primary component is a mapping table implemented as a set-associative cache, which stores precise information of the last writer. New entries are added into this structure when a transaction completes the hazard detection stage. As new entries are added, existing entries will be evicted into a secondary component called a recency bloom filter. Evicted addresses from the primary component are hashed into one of the many slots in the bloom filter. The slot is updated if the cts stored in the slot is older than the evicted cts. Although some addresses will be aliased, this scheme ensures that the bloom filter will always return either a precise cts, or a younger one.
The paper also points out that transactions may access inconsistent data due to the fact that reads are not validated until the commit point. Although most transactions will simply abort when it comes to validation, in some corner cases, transactions may be trapped into undefined behavior, such as infinite loops. To avoid such doomed transactions from blocking the entire wrap from making progress, a watchdog timer is configured to interrupt transaction execution and trigger validation periodically. Transactions that are trapped into infinite loops can have a chance to validate and then abort prematurally.