EcoTM: Conflict-Aware Economical Unbounded Hardware Transactional Memory



- Hide Paper Summary
Paper Title: EcoTM: Conflict-Aware Economical Unbounded Hardware Transactional Memory
Link: https://www.sciencedirect.com/science/article/pii/S1877050913003335
Year: ICCS 2013
Keyword: EcoTM; HTM; Directory

- Hide HTM Summary
Conflict Detection: Eager
Conflict Resolution: Lazy
Version Management: Lazy
R/W Bookkeeping: L1/Sepcial Hardware



Back

This paper proposes EcoTM, an HTM design that performs conflict detection using a hierarchy of conflict maps. In a classical Eager-Lazy HTM, conflicts are detected on a distributed manner, usually as a side-effect of cache coherence: Transactionally accessed cache lines are marked by the local L1/L2 cache, and conflicting accesses are detected as a coherence message is issued by another processor. This scheme, simple as it is, faces several challenges that must be addressed by a more practical HTM design. First, transactional states will be lost when speculative cache lines are evicted. A simple solution is to always declare conflict after an overflow, with high probablity of false positives. Second, conflicts are resolved eagerly. A consequence of this is that the transaction that aborts another one might itself be aborted, making the first abort unnecessary. Finally, a substantial amount of metadata has to be stored in the cache, if unbounded transactions are to be supported. The amount of metadata is propotional to the largest of transactions supported, while in practice only a small fraction of them will actually be used for conflict detection.

EcoTM solves the above problems by using a hierarchy of metadata storage, assuming that actual conflicting cache lines are only a small portion of the total number of transactional cache lines. Under this assumption, not all metadata are treated equally. Instead, only metadata that is most recently used will be cached in a small structure which allows fast accessing, while the majority of them stays in a distibuted manner and can be requested only at certain cases.

The first level of the hierarchy is the coherence directory. A cache line can be in the following four states at any given moment: (1) Non-transactional; (2) Speculatively read by a processor but not written into; (3) Speculatively written by a processor but neither read nor written by another; and (4) Conflict has been detected because there were conflicting accesses. EcoTM encodes the extra transactional state of a cache using two bits, called Quick Conflict Filter (QCF), which only adds a constant number of storage per directory entry. Any transactional read/write request from the processor will set QCF into a corresponding state. We discuss the details below.

QCF provides an easy way of detecting whether conflicts have occurred. It is not sufficient to precisely identify which processors are involved in the conflict. The second level in the hierarchy is called Conflict Checker (CC). The CC is organized as a small, fully associative lookup table, in which cache line addresses are used as the tag. Each entry stores two bits per processor in the system: one bit for speculative reads and one for writes. Everytime a conflicting coherence request hits the directory, the corresponding bit in the CC will also be set. Note that for transactional accesses, as long as they do not create conflicts, the CC is not involved. When a conflict occurs, the directory knows the identify of both processors of the conflict, as the directory itself remembers the current owner of the cache line. In practice, since the number of conflicts are relatively small compared with the total number of speculative cache lines, there is no contention on the CC side, as it is only contacted when a conflicting access is detected.

The CC is only an approximation of the full set of conflicting lines, as it has limited capacity (64 entries in the paper). The capacity of CC is also restricted by the fully associative nature of the structure. When the CC overflows, it simply discards a selected entry. The next time a conflict occurs on the same address, the CC needs to multi-cast a request to all owners of the cache line. The corresponding bit will be set after responses from processors are received.

The directory also maintains a Conflict Map, which is a matrix of size (N - 1)2 where N is the number of processors in the system. If bit i of processor j is set, it means that processor j has requested a cache line in a conflicting mode that has been speculatively accessed by processor i. The conflict map entries are set as CC processes coherence requests forwarded from the directory.

On transaction commit, the processor sends a commit request to the conflict map. The conflict map aborts transactions on processors that have the bit set by sending an abort request. Transactions running on other processors must abort immediately as they receive this request. The bits in QCF and CC will also be cleared.

When the private cache overflows, the processor sets a bit to indicate that future requests for the status of any cache should be responded with an affirmative. If speculative states overflow, however, the transaction must immediately abort. An improvement over this simple scheme is to allow speculative states to be overflowed to an external log. The log is maintained in virtual address space by the OS. The hardware writes the address of speculative lines and optionally data (only for dirty lines) into the log on overflows. On receiving a query from the CC, the processor searches the log to determine the status of a cache line.

In many designs where the directory cannot cover the entire address space, it is possible that directory entries are evicted due to limited space (e.g. the directory is maintained by the L3). The QCF must not be discarded in this case, because real conflicts might be missed, and there is no way to reconstruct the QCF using the system’s knowledge. The directory overflow must therefore be handled by storing them into DRAM cells, either taking advantage of the unused ECC bits, or extending DRAM rows to allow a few extra bits. The QCF will be loaded back into the directory the next time an entry is allocated.