The Dirty-Block Index



- Hide Paper Summary
Paper Title: The Dirty-Block Index
Link: https://dl.acm.org/citation.cfm?id=2665697
Year: ISCA 2014
Keyword: Dirty Block Index; Cache; Write Back



Back

Several cache system optimizations rely on the dirty bit storage being fast and having low latency. This, however, is usually not the case in today’s design, where the dirty bit is part of the tag array. In order to access the array, an index needs to be generated and is used to activate all tags in the set. This paper proposes an alternative storage format of dirty bits in a cache. Instead of encoding the dirty bit either with the cache coherence state (e.g. “M” state in MESI protocol) or as a separate bit for each element in the set, this paper recommends a scheme, called the Dirty-Block Index (DBI), which decouples the storage of cache sets and dirty bits. The DBI scheme stores dirty bits as an array of tagged entries. Each entry stores a bit array for a DRAM row. Each bit in the entry represents the dirty status of a cache block in the row. The tag of the entry encodes the identity of the DRAM row. The entire DBI storage is organized as a fully associative cache. The dirty status of a cache line can be located using the row address and the cache block offset in the row.

As DBI decouples the storage of cache tags and dirty bits, information must be maintained in a consistent manner between the tag store and DBI. The invariant is that a cache line is dirty if and only if there is an entry for the line and the corresponding bit is set. To maintain the invariant, when a dirty cache line is evicted, the corresponding entry in the DBI should be cleared to reflect the change. If all bits in an entry are clear, then the entry itself is freed. Similarly, when a cache line becomes dirty or a dirty cache line is evicted by higher level caches, the DBI should set the bit in the corresponding entry. If the entry has not existed yet, it needs to be allocated by the cache controller. If no free entry exists in the DBI, the cache controller evicts one entry according to the DBI replacement policy. The eviction of DBI entry is also always accompanied with a bulk write back, during which dirty lines are written back to the DRAM to keep them consistent with the dirty bits. The write back does not have to be on the critical path of the DBI eviction operation, as the cache controller could buffer the evicted cache lines and write them back in the background. The write back, although seems heavyweight and slow, can be accomplished in relatively shorter time compared with a normal cache set write back of the same size. This is because most DRAMs have a row buffer that can accelerate accesses if they exhibit locality. When writing back the entire DRAM row, the write back operation can always hit the row buffer except the first one, and hence can complete in short time.

Several design decisions have to be made when architecting DBI for the cache system. The first is the size of the DBI, characterized by the maximum number of cache lines it can track. We use α to represent the ratio between the number of lines tracked by DBI and the total number of lines in the cache. The smaller α is, the lower latency DBI can be accessed with. On the other hand, if α becomes too small, the working set of most applications would exceed the capacity of the DBI. In this case, it is likely that thrashing may occur, where cache lines are repeatedly written back after being written into. The second design decision is the size of entries. In our previous description, we assume implicitly that an entry can track the entire DRAM row. In fact, the size of an entry can be smaller than a DRAM row. This is another instance of the trade-off between the locality that DBI can capture and the maximum supported working set size. The third design decision is the replacement policy. Similar to a cache replacement policy, there are numerous choices favoring different patterns and under different assumptions. In the paper it is claimed that the Least Recently Written (LRW) is sufficient to achieve reasonable performance.

DBI enables optimizations that are not possible or hard to implement in an ordinary cache design. One of them is Aggressive Write Back (AWB), where cache lines belonging to the same DRAM row are written back together. As explained in previous paragraphs, this scheme has the advantage of exploiting the locality of the row buffer, and hence is normally faster than writing these cache lines back in several individual requests. On a cache line eviction, the cache controller looks up cache lines that are on the same DRAM row as the evicted cache line, and writes them back in a single request. This is fast because DBI stores dirty bits for a DRAM row. Note that cache lines written back in AWB scheme does not need to be evicted from the cache except the one that triggers the write back. It is sufficient just setting them back to the clean state.

Another optimization, Cache Lookup Bypass (CLB), allows the cache controller to avoid checking the tag store and to directly fetch the line from the DRAM, when the confidence that the line will not hit the cache is high. If the prediction is correct, the latency of load operations can be shortened, because cache lookup is not on the critical path of most load instructions. This, unfortunately, does not work for dirty lines. Former solutions are either too complicated, or have to live up with the latency of activating the tag store. With DBI, checking for dirty status becomes trivial. The cache controller simply looks up the DBI with the row tag.

The last optimization reduces ECC storage overhead of the cache. Conventional cache design has an ECC field for every line no matter dirty or not. This is in fact an overkill, because expensive ECC is only required for dirty lines. Clean cache lines can be re-fetched from lower level of the hierarchy if they are know to be corrupted, and hence only error detection code (EDC) is needed. With DBI, ECCs are maintained only for dirty lines, which are stored with DBI. The tag array instead uses simpler EDC for error detection, but not correction. Both real estate requirement and energy consumption of the cache can be reduced if this optimization is applied.