DICE: Compressing DRAM Caches for Bandwidth and Capacity
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3080243
Year: 2017
Keyword: DRAM Cache; Cache Compression
Dynamic Indexing Cache Compression (DICE) is a technique designed to increase the bandwidth of the DRAM cache. Traditional algorithms of cache compression focus mainly on reducing the storage of a cache line by taking advantage of data patterns. This can indeed increase the cache hit rate for smaller L1 and L2 caches, because more lines can be stored without the cost of larger SRAM. For DRAM caches, however, compressing for capacity may not bring as much benefit as it would for smaller caches. First, DRAM caches are typically few hundreds megabytes or even gigabytes in size. Increasing the size of the DRAM cache by compressing lines has only marginal effect on the hit rate. Second, DRAM cache is more flexible than SRAM cache as it stores both tags and lines in DRAM array. It is hence easier for the cache controller to alter the storage format of compressed lines. In SRAM this is impossible as extra tag arrays must be added. Furthermore, DRAM caches are usually direct-mapped, because looking for multiple tags is costly. All of these subtlties suggest a different design perspective than what used to be for SRAM caches.
Instead of compressing for capacity, this paper claims that compressing for bandwidth in DRAM caches are beneficial to performance. If two cache lines can be delivered in one bus read transaction, and both lines are useful for the computation, then the bandwidth of the DRAM cache is effectively doubled. Determining which two cache lines should be compressed into the same 64 byte storage, however, requires some careful thinking. The classical way of using lower bits (Traditional Set Indexing, TSI) of the line address maps spatially consecutive lines into sets that are far away from each other and are likely not on the same row. Since the row buffer of a DRAM cache is typically several KBs (2KB in the paper), the spatial locality of the DRAM cache does not work well with the spatial locality of the access pattern.
Using only top bits of the cache address (Naive Spatial Indexing, NSI) to map lines into sets is also a bad idea. This scheme maps all cache lines from a large consecutive range into one set. The conflict rate would be very high, given that the spatial locality of accesses would hit only one or few sets for most of the time. A conflict occurs when the 64 byte storage does not have enough space to hold a new compressed cache line mapped to the set. The content of the set should be written back to the main memory before the new cache line can be stored in DRAM cache.
The paper proposes a new mapping scheme, Bandwidth Aware Indexing (BAI), that allows two consecutive cache lines to be mapped into the same set, while not all adjacent lines are mapped into the same set. Assume the DRAM cache has 2i lines capacity. The set index for a given cache address c is computed by concatenating c[i - 1:1] and c[i]. After the concatenation, c[i] is at the lowest bit of the index. By using BAI, two consecutive lines (even + odd) are mapped to the same set. For each set, the two cache lines that can map to it are far away from each other exatly as in TSI. One of the good property of BAI is that compared with TSI scheme, all cache lines are either mapped to the same set, or to an adjacent set as in TSI. This allows DICE to dynamically select one of these two possible sets for any cache line by one read operation of the row.
Neither TSI nor BAI works well for all workloads. When cache lines are compressible, BAI provides a clear advantage over TSI because two compressed cache lines are likely to share the same 64 byte storage. When cache lines are not compressible, however, TSI demonstrates very poor utilization of the DRAM cache, because adjacent cache lines evict each other from the slot they are mapped to. TSI, on the other hand, performs better, because adjacent lines are mapped to different sets.
The result above suggests a hybrid approach that allows the cache controller to dynamically pick one of the TSI and BAI sets for storing a compressed cache, depending on whether there is enough storage. As stated in earlier paragraphs, BAI maintains the invariant that it either always maps a line to the same set as in TSI, or to a set that is one slot away. To figure out which of the two possible locations a cache line would be in, the cache controller performs a read operations of the entire row, and then probes both locations by comparing the tags stored in the row and the tag generated from the address. If a match is found, all cache lines stored in the set are delivered to higher level caches before they are decompressed. The read bandwidth of the DRAM cache is at least doubled if more than one cache line is stored in the set.
To reduce the latency of tag comparison, the paper also proposes adding a cache index predictor (CIP) that predicts which of the two caches should be probed first for read operations. The prediction is based on the observation that cache lines on the same page usually have similar compressibility. It uses a slice of the page address as the key to index the predictor table, which records the set that hits the DRAM cache last time the cache was queried. The table will later be updated with the result of the lookup. For write operations, the index is predicted by the compressibility of the line.