2DCC: Cache Compression in Two Dimensions
- Hide Paper Summary
Link: https://past.date-conference.com/proceedings-archive/2020/html/0897.html
Year: DATE 2020
Keyword: Compression; Cache Compression; 2DCC; Deduplication; 2d Compression
Highlights:
-
Observed that the hash table can tolerate both false neg and false pos and therefore the update protocol for the hash table can be relaxed. This simplifies design as the hash table can be updated by its own controller circuit.
-
Combines deduplication with compression to further save storage
Questions
-
At the bottom of page 2 and beginning of page 3, it is said “Because the common case is that incoming lines are unique”. So that implies that deduplication actually only occur for a minority of the case. Does this also imply that deduplication only has marginal improvement?
-
The hash table and full block comparison requires full data array compression, which may overload the data banks and decompression circuit with excessive reads and decompression requests. If deduplication is common (although implied the opposite - let’s assume it is common) then each cache read and fill request will require two, instead of one, accesses to data banks, doubling the effective bandwidth requirement. The same applies for decompression circuit. One solution is to add extra ports and decompression logic at the cost of larger silicon and extra power consumption
This paper proposes Two-Dimentional Cache Compression, 2DCC, a cache compression design leveraging both intra-line and inter-line redundancy. This paper begins by pointing out that both intra-line and inter-line redundancy exist in the working set of typical workloads, as cache lines, as a whole, tend to contain similar to identical content in addition to the more classical redundancy between individual words within a line. Conventional cache compression schemes, however, only takes advantage of intra-line dependency, since it is most natural to process individual lines rather than combing several lines together, as cache lines are the most fundamental unit of data communication between components in the memory hierarchy. On the other hand, prior LLC deduplication schemes proposes LLC deduplication using a hash table to perform fast detection of duplicated blocks. This scheme does not benefit from intra-line redundancy, which is also suboptimal.
2DCC combines conventional cache line compression and deduplication as follows. First, compression is still performed on individual cache lines using BDI, and stored in a decoupled data array. Tags are over-provisioned, and is organized such that one tag can map to an arbitrary data slot in the data array. Second, hash values of the most recent few lines are maintained in a separate hash table, which is also organized as a set-associative lookup table. Cache lines written back from upper levels and fetched from lower levels are hashed and compared with table entries. On a value and data match, the incoming cache line will not be stored. Instead, only a tag entry is allocated, and the tag points to the data slot that has already been in the cache. Lastly, the paper also proposes a simple but yet effective replacement policy for the fully associative data array. Cache lines are evicted not based on recency of usage, but simply based on random sampling and sharer count, as we will see below.
The tag array is still organized as a classical set-associative array indiced using lower bits of the requested block address. The tag array is over-provisioned, such that either more sets are present, or each set has more entries (i.e. higher associativity) than before. In addition to the conventional information such as valid/dirty bit, coherence states and replacement status, each tag also contains a global data array pointer, and two global tag array pointers. The data array pointer indicates the associated data slot for the tag entry, which must be accessed in a serialized manner. The two tag arrays form a doubly linked list among tags that share the same data slot. When the data slot is evicted, all tag entries sharing that slot must be invalidated as well following the doubly linked list. When a single tag is evicted due to other reasons, the doubly linked list helps removing the tag entry from the list in constant time with a few pointer updates.
The data array, on the contrary, is a fully-associative array. Each entry in the array can be addressed by any tag entry. The tag array is also segmented, meaning a physical slot is divided into eight smaller 8-byte segments, each being able to store a full or partial compressed block. To simplify storage management, 2DCC mandates that a compressed block must only be stored in a consecutive range of segments in the same physical slot, and that segments are always made available by evicting existing segments (the paper does not mention compaction). The fully decoupled data array resembles previous proposals such as the V-Way Cache, but the segmented design makes a huge difference between V-Way cache and 2DCC which we describe as follows. In V-Way cache, a data slot is always acquired by either allocating a free block via a free bit vector, or evicting an existing data slot to make free space. In 2DCC, however, since one physical slot can store more than one compressed lines, the search for available slots are significantly more complicated, since the occupancy status of a block is no longer representable by a single free/used bit, but a bitmap showing the occupancy of each segment. Tags are replaced when a new address is to be inserted, but there is no free tag entry in the current set. Tag replacement policy follows whatever the conventional policy is without any change.
To simplify searching for physical slots, 2DCC’s replacement algorithm still observes the division of sets, only when a new compressed line is brought in and seeking for a physical slot with enough segments to store it. The replacement algorithm has three levels, each corresponding two one stage of searching. In the first level, the V-Way cache stype bit mask is searched for a entirely vacant slot. If this could not be found, then four sets will be randomly selected. Sets are small partitions of the tag array as in the conventional cache (set size is not defined, though). Then at the second level, each block in the four sets are examined, and the first block that contains sufficient segments is selected as the slot for holding the compressed line. At the third level, one of the blocks in the four sets are selected as the victim, and compressed blocks are evicted. The criterion for selection is to minimize the number of invalidated tags, which needs to consider both block sharing and block compression. Note that the paper also points out that recency- or frequency-based replacement algorithms are not suitable for data array replacement, since tag replacement has already taken either or both into consideration.
To help tracking potentially multiple tags of a single block, each segment is also associated with a status field indicating the state of the segment (free, occupied, head of block, etc.), and back pointer to the head of the linked list. A reference count field is also maintained for the eviction algorithm to determine when segments can be freed and/or whether to invalidate the segment on data slot replacement. Although not mentioned in the paper, each segment should also track the size of compressed data, since data slot replacement requires this field to compute replacement. All additional metadata is only maintained for the head segment of a compressed block to reduce update complexity.
Deduplication is performed with a separate hash table which tracks the hash value of the most recent blocks. The hash table is organized as a set-associative lookup table, using hash values as the lookup key. Each entry contains a hardware pointer to the corresponding data slot and the segment. On a hash table hit, the compressed block in the segments is fetched and decompressed, and then compared with the incoming block. After a full block comparison, if the two blocks are entirely identical, then the incoming block is discarded, and a new tag entry is allocated which points to the hash table entry. The reference count of the header segment is also incremented. When new hash values are inserted, an existing hash value is evicted if no free entry can be found. The replacement of the hash table entries is entirely decoupled from cache replacement. In other words, the content of the hash table only needs to be kept roughly consistent with the cache, without affecting correctness. In fact, the hash table can tolerage both false negative and false positive, allowing a more flexible update protocol. For example, when a dirty line is written back, the hash table need not be updated instantly, and multiple updates can be conducted in a batch to amortize the cost. False positives can be ruled out by full block comparison, which is always required with hash tables and therefore does not pose an extra overhead. False negatives only cause duplicated blocks to be stored in the data array, resulting in slightly decreased effective size compared with the optimal case. The paper recommends using an 1024-entry hash table with 16 ways.
The read and update protocol is similar to the one of V-Way cache and other segmented cache designs except for the following case. If a line is updated by dirty value, and the header segment’s reference count is greather than one, then a copy-on-write style update begins by first evicting segments using the data array replacement algorithm as discussed above. The reference count is also decremented, and the tag is removed from the doubly linked list before inserted into the new block’s linked list as the sole element. No tag entry is evicted in this case since the entry already exists and shared a block with some other tags.