Last-Level Cache Deduplication
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2597652.2597655
Year: ICS 2014
Keyword: Cache; LLC; Deduplication
Highlight:
- Hardware assisted hash table is efficient for dedup or searching since multiple buckets could be read at the same time. It is essentially just a reduced form of set-associative cache.
This paper proposes deduplicated last-level cache for increasing effective cache size. Previously, researchers have been proposing techniques such as cache compression on block basis to reduce the size of a cache line, increasing effective cache size. This paper adopts a different approach by identifying the potential of cache deduplication, a technique for detecting iddentical cache lines and avoiding storing multiple physical line with the same content. The paper begins by making the claim that in scientific computation workloads, cache lines are likely to contain duplicated data for several reasons. First, some workloads tend to copy the same piece of data around for the purpose of initialization or interfacing with libraries requiring input/output buffer (e.g. network protocol or file system operation). Applications that extensively use memcpy() also creates duplicated lines on different physical locations. The second cause for duplication is symmetry of the program input. More than often, scientific applications begin their computation with highly regular data consisting of duplicated patterns. Such pattern can also be exploited to reduce cache consumption. The paper also noted that most of these duplicated lines are not pure zeros. Simply detecting zero lines will not work well in the majority of the cases.
The paper then identifies a few challenges of designing a cache deduplication scheme. First, the deduplication hardware should be carefully designed, such that duplications can be detected efficiently, without incurring high cycle penalty, power consumption, or area overhead. Second, The deduplication granularity should be carefully selected to reach a balance between the probablity that duplications are detected, and the hardware overhead of fine-grained deduplication. The last challenge is that the cache layout should be modified to allow many-to-one mapping between tags and data slots. In addition, there shold be more tags than data slots in order to support a large effective cache size.
We next introduce the cache organization as follows. As discussed above. the deduplicated cache decouples tag array and data array from the one-to-one static mapping scheme, adding one extra level of indirection in tag address to allow more than one tags being mapped to one data slot. To achieve this, each tag array is extended with an extra “data pointer” (dptr) field, which points to the data slot that stores the content implied by the address tag. To help bulk invalidation of tags when the shared data slot is invalidated, tags are also linked together with other tags that share the same data slot using a doubly linked list. Two more pointers, called “tag pointers” (tptr), are added to each tag for finding the successor and predecessor of nodes in the linked list. Conventional per-line metadata, such as coherence states and status bits, are not changed.
Although the paper does not mention over-provisioning of the tag array, it is necessary to provide more tags than data slots in order to exhibit a larger effective cache size. The paper suggests later in evaluation section that doubling the number of ways or the number of sets both work. In the prior case, more tags are accessed in parallel which can increase latency and energy consumption. In the latter case, the indexing function must be changed to use one more bits from the input address, while one less bit is stored in the tag.
The data array is entirely decoupled from the tag array, and becomes fully associative, enabling any tag in the tag array to address any data slot in the data array. This is mandatory in deduplicated cache, since duplicated cache lines are likely not in the same set. Data slots are extended with three extra fields. The first field is a reference counter tracking the number of tags pointing to the data slot. The second field is a “tag pointer” (tptr) field. This field is used differently based on the status of the data slot. If the data slot is free, indicated by a value zero in the reference counter, the tptr field points to the next free data slot, forming a free list, the head of which is stored in a special register in the LLC controller. When an empry data slot is needed, one is acquired from the free list, if it is not empty. If the data slot is not free, then the tptr field points to the head of the tag doubly linked list that shares the data slot. When the data slot is to be invalidated, the LLC controller follows this link to find all tags that should be invalidated. The last new field is a one bit deduplication status field. If this field is set, the slot has already been checked for deduplication. Otherwise, the check has not been performed, and the LLC controller should deduplicate the block when it is idle.
A hash table is also added to help identify duplicated blocks. The hash table is organized as a 16-way set-associative cache with unknown number of sets (it is configurable), and it functions as a content-addressable memory (CAM). Each way consists of two fields: full hash value and pointer to the data slot. Each set in the hash table is used as a bucket for resolving conflicts. All ways within a set is checked in parallel when the hash table is queried. The hash table query uses the content of the line to be deduplicated. The line is first hashed into a 15 bit value with XOR gates. Then a bucket is selected using partial bits in the hash value, after which all bucket entries are checked against the full hash value. A hit is signaled if any of the full hash values stored in the bucket matches with the input value. A hash matching itself, however, is not sufficient. To determine whether the hit really indicates a duplicated line, the data pointer is used to find the content and a full cache line comparison is performed. Otherwise, if the cache line misses in the hash table, it will be inserted into the table. In rare cases, all buckets will be full on such insertions. The paper suggests that one of the entries in the bucket with data slot reference counter being one could be evicted (and so do the tag and data). If, however, all data slots have reference counter larger than one, the insertion is aborted, but the cache line is not evicted. The paper claims that although this strategy loses some chances of deduplication, the overall impact would be minimum, since hash collisions only occur infrequently.
Read operations on the deduplicated LLC is almost the same as a conventional LLC. The only difference is that tag read and data read are serialized, since the controller should first find a matching tag, and then access the data array with the data pointer (some caches already implement this to reduce energy consumption). On a cache miss, the controller either evicts a tag or a data slot to make space for the fill request. Note that since tag and data storage are decoupled, their evictions also use different strategies. A tag is evicted if there is no free tag in the current set. When evicting a tag, the LLC controller needs to unlink it from the doubly linked list, and decrement the reference counter of the data slot it addresses. If the reference counter drops to zero, the data slot is also evicted. If there is empty tag, but no free data slot exists, which is possible since there are more tags than data slots, one data slot is found and evicted. All tags associated with that data slot are also found and evicted by following the tag pointer and linked list pointer.
While tag eviction uses the conventional algorithm, data eviction could not use the same per-set eviction algorithm, since the data array is fully associative. The paper proposes that the LLC controller selects four random locations in the data array when it is full, and evicts the one with only a single reference count. If all four are referenced by multiple tags, then the one with the minimum count is evicted. The paper noted that for most the time, one random probe is sufficient for finding the eviction candidate. In the rare case where multiple probes are needed, the paper argues that it must be because only few cache lines are deduplicated, implying high chance of fining a candidate in the first few probes, since most of the blocks are not reused.
Write operations are handled differently from reads, since write could change the content of the line. Write misses are handled just like read misses, except that the new line is acquired from lower level with write permission. Write hits, however, could not directly change the content of the line. Instead, the reference counter is checked. If the value of the counter is larger than one, indicating a deduplicated line, the LLC controller should treat this as a line fill, potentially evicting a data block, and duplicate the line before it is updated.