HeTM: Transactional Memory for Heterogeneous Systems
- Hide Paper Summary
Link: N/A
Year: PACT 2019
Keyword: HTM; GPU
This paper presenrs HeTM, a unified transactional memory capable of running on both CPU and GPU. The paper begins by identifying that some modern workloads can benefit greatly from the high parallelism provided by GPGPU, while requiring transactional semantics, i.e. individual threads must appear that they are executed in a certain total ordering. The paper proposes a scheme for incorporating the TM semnatics into GPGPU’s SIMT execution model, enabling concurrent execution of transactions on both devices while maintaining the global transactional semantics by the application of hierarchical conflict detection. We describe the scheme as follows.
The paper assumes that non-heterogeneous TM implementations exist on both CPU and GPU. The paper does not specify the implementation to be used, but rather, in order for a TM implementation to work correctly under HeTM, they must provide the abstration that transactions on each device are assigned commit timestamps which are consistent with the logical ordering. In addition, all transactions, including those that finally abort, must not dependent on other transactions that aborted, i.e. aborted states should not affect the outcome of execution. The latter property is critical to HeTM, since transactions on both devices execute locally before they eventually synchronize. If transactions’ outcome are dependent on uncommitted states, the result of local executions may not be equivalent to the result of an abstract, globally serial history.
HeTM provides a transactional interface for programmers to wrap the transaction body and arguments. Instead of the traditional model of execution in which the body is executed immediately after the transaction begin instruction, HeTM adopts a queuing execution model for better scheduling on GPU. The queuing model in HeTM wraps transaction body and arguments into an instance of the transaction, which is then pushed into a transaction queue. For heterogeneous system consisting of a CPU and a GPU, three queues are provided: The CPU queue, the GPU queue, and the shared queue. An instance is pushed into the first two queues if the programmer specifies device affinity when starting the transaction. The transaction is pushed into the shared queue otherwise. A scheduler selects transactions from the three queues, and dispstach them to the corresponding device based on the type of the queue and/or the queue length. For the CPU queue, the scheduler selects one instance at a time and dispatches it to an idle core for execution. For the GPU queue, the scheduler waits until there are sufficient number of instances, and then dispatch the batch to a transaction kernel running on the GPU. The transaction kernel is the GPU version of the transaction body written by the programmer. Note that HeTM requires that programmers provide both implementations of the transaction body if transactions are to be executed on both devices. Transactions in the shared queue can be dispatched to either device.
HeTM assumes that transactional metadata is only maintained for a subset of the available addresses. This subset of addresses is called a “Speculative Transaction Memory Region” (STMR). One copy of STMR is maintained by each device in their private memory, such that transactions at each device can execute independently without data communication until they synchronize. Note that this assumption actually causes trouble for pointer-based data structures, since the semantics of pointers will change if the two STMRs are not mapped to the same base address. As a solution, the paper proposes that pointer operations be instrumented by the compiler, which transforms the value of pointers into relative addresses from the STMR base. HeTM also assumes that transactional metadata are not co-located with data in STMR, i.e. transactional metadata can be modified locally without causing conflicts. This is typically true for most HTMs and STMs. HeTM also maintains three additional data structures, one on the CPU side, two on the GPU side. On the CPU side, transactions maintain an append-only write-set log which consists of (address, data, timestamp) tuples committed by CPU transactions. The timestamp is needed to identify the last write on a certain address which can potentially be written multiple times during an interval. On the GPU side, threads collaborately maintain two bitmaps: A RS bitmap in which a bit is set if the corresponding word is read by a transaction, and a WS bitmap maintained in a similar way which records addresses to be written. The two bitmaps need not be precise (i.e. they can be bloom filters), but it is required that all aliased addresses be recovered quickly given the offset of a bit.
HeTM transactions are executed locally once they are dispatched. The scheduler monitors execution statistics on both devices, and pause the execution priodically for inter-device conflict detection. Intra-device conflict detections are performed by the TM library which is out of the scope of this paper. At a high level, the goal of synchronization is to check whether all GPU transactions can be serialized after all CPU transactions. To this end, we must guarantee that data items modified by CPU transactions are not read or written by the GPU (in the latter case, due to the fact that we copy CPU data to GPU at the end of the interval, if we allow write-write conflict, GPU writes might be undone). To achieve this goal, at the end of the execution interval, the scheduler sends the write-set log on CPU to the GPU, which is then tested with the GPU’s own read set (which is a superset of the GPU write set). The paper argues that this algorithm can be made efficient based on two observations. First, most transactions feature a significantly smaller write set than the read set. By sending the write-set log over the slower system bus to the GPU, only the most essential part of CPU transactions is transferred. Second, set intersection test has intrinsically high parallelism, which can be implemented as a “validation kernel” on the GPU fairly easily.
If the validation succeeds, the GPU then iterates over the write-set log again, and applies dirty data from CPU to its own STMR. Note that if an address is written multiple times on the CPU side, there will be multiple entries in the write-set log. In this case, the GPU uses the timestamp to determine which write should override which. In the meantime, the GPU also collects its local dirty data set (which is guaranteed to be disjoint with CPU dirty data set), and sends them to the CPU. On receiving the updated data from GPU, the CPU applies them to its own STMR, such that both devices have an identical view of the STMR.
If, on the other hand, validation fails, all GPU transactions within the current interval are rolled back. The CPU write-set log is still applied as described above. The GPU work is then undone by fetching CPU data according to the write set bitmap. Both STMRs are consistent after the roll back completes.
As a special optimization, the paper proposes having doubly-buffered STMR on the GPU side. One shadow copy is maintained in addition to the main copy. When the GPU executes transactions, only the main copy is modified, while the shadow copy serves as a reference of the pre-image before the current interval. When the GPU validation fails, the shadow copy is used for recovery by directly copying the write-set log from the CPU onto the shadow copy. At the beginning of the validation, we create a newer shadow copy while validation is being processed. Once the newer shadow copy is created, the next round can be immediately started on the GPU without waiting for validation outcome, essentially overlapping the execution of the next round with the validation of the current round.