Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching



- Hide Paper Summary
Paper Title: Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Link: https://dl.acm.org/doi/10.1145/2540708.2540715
Year: MICRO 2013
Keyword: Compression; Cache Tags; DCC



Back

This paper proposes Decoupled Compressed Cache (DCC), a tag mapping framework for compressed LLC featuring low area cost and high space utilization. The paper points out that previous segmented tag mapping schemes have three major issues. The first is internal fragmentation, which happens when compressed blocks could only be aligned to a certain segment boundary, as in segmented cache designs. Blocks could not take advantage of every byte of the physical segment allocated to it, wasting the rest storage of the segment. The second issue is external fragmentation, leading to suboptimal utilization of the segments if compressed blocks are required to be stored in consecutive range of blocks. Even if the total number of bytes in free segments are sufficient, these segments may not form a consecutive range that is large enough to hold the block, requiring compaction of existing blocks. On modern cache architecture, segments are usually implemented as data array banks. Compaction a set would require reading and writing all the banks, causing extra power consumption and head dissipation. The last issue is tagging overhead. In conventional segmented cache schemes, the maximum number of compressed blocks that can be stored within a set is bounded by the number of tags per-set. Over-provisioning of tags are required if more compressed blocks are expected to be stored in the set. More tags per-set, however, have a nagative impact on power and access latency, since the larger the tag array, the more power and cycles it takes to access the array on every access.

This paper takes advantage of two observations. First, LLC still preserves the spatial locality of access. Given a super block whose size is four regular 64 byte blocks and is aligned to four-block boundary, in most cases, more than 1 regular blocks within the super block are cached by the LLC. The second observation is that external fragmentation can be easily solved by making the segment array fully-associative. As a result, a compressed block does not have to be stored in a consecutive range of segments. Instead, the block can reside in several non-continuous segments, which will be concatenated together by the block read circult on cache accesses.

We next describe DCC in details. Similar to segmented caches, the DCC allows arbitrary mapping between tags and segments within a cache set. The physical storage of the set is divided into 16 byte segments. Tags are not over-privisioned. For a n-way set-associative cache, there are still n address tags. These tags, however, now store addresses of super blocks, rather than regular blocks. As discussed above, a super block consists of four regular blocks, which is also aligned to the four-block boundary. The lower two bits of the block address is used as the block offset instead of forming the set index. Set index bits are extracted after the block offset.

Recall that DCC allows fully associative mapping between tags and segments. To this end, the tag no longer have a pointer to the beginning segment in which a compressed block is stored. Instead, each segment is described by a back pointer field consisting of two fields. The first field, tag index, indicates the super block tag the segment is allocated to. The second field, block offset, indicates the offset of the regular block of which compressed data is stored in the segment. Block data is still required to be stored in-order, i.e. if a compressed block is stored in more than one segments, the cache controller must follow a certain order to combine these segments into a full block. The paper suggests that these back pointers be stored in a separate bank, which could be access in parallel with super block tags during cache accesses. Valid bits are maintained per-segment to aid segment allocation. Coherence states and compression states are maintained per-block within the super block tag. The paper assumes a super block size of four, and therefore, each super block tag contains four copies of coherence states and compression type fields.

On cache accesses, the cache controller first computes the set index by extaring bits from the requested address, and then read out all super block tags within the set. The requested address is then compared with tag addresses. In the meantime, the block offset is also extracted from the requested address, and the back pointers of the same set are also accessed in parallel with address tags. The cache controller also compares the block offset field with block offset from the address tag. A cache hit is signaled, if (1) one of the n super block tags match the requested address; (2) The extracted block offset matches at least one block offset field in back pointers; (3) The tag ID of the matching back pointer entry in condition (2) is the same as the matching tag in condition (1). These three conditions can be checked using simple combination logic after a comparator array, which does not add significant latency to the cache access logic. On a cache hit, the controller circuit reads out all segments that match both the tag ID and the block offset ID in the previous step, and concatenates these segments together to form the compressed line. The compressed line is then sent to the decoding logic for decompression.

Three types of cache misses could happen during an access. The first type of miss is signaled when there is a super block tag match, but none of the segments match condition (2) and (3). In this case, the super block tag is preserved, but one of the four blocks are evicted. The cache controller selects the block to be evicted using some replacement algorithms, and then invalidate all segments (with possible write backs) with tag ID and block offset matching the block to be invalidated. The second type miss occurs when none of the tags match the requested address. Condition (2) and (3) does not matter in this case. The cache controller will evict one of the n super blocks, together with all segments with a matching tag ID (block offset is ignored in this case). There should be two independent replacement algorithms, one within a super block, and the other for super blocks as a whole. The paper, however, does not give any detail on the algorithm.

The last type of cache miss is unique to compressed architecture, which only happens during a cache write. If the size of the compressed block increases after a cache write, such that one or more segments are needed to store the compressed block, the controller also needs to signal a miss, and evict one of the existing segments in the current super block. Similarly, if the size of the compressed block, in the number of segments, becomes smaller, the controller should deallocate the last segment used by the same block by marking it as invalid (e.g. setting the invalid bit).

As the baseline DCC design reduces external fragmentation and tagging overhead by allowing fully associative segments and super tags, the paper also proposes Co-DCC to further reduce internal fragmentation caused by compressed blocks not fully utilizing the trailing bytes of a segment, if the compressed block size is not a multiple of segment size. Co-DCC works by allowing blocks within a super block to be stored compactly without havig to align each compressed block to the beginning of a segment. By loosening this constraint, now each segment could contain potentially four blocks from the same super block (different super blocks could not share a segment to simplify cache lookup), and the super block tag should store the starting byte offset of the block in the segment for each block. Co-DCC extends the per-block state tags with a 7 bit “begin offset” field to indicate the byte address of the compressed block in the first segment. If the compressed block could not be accommodated by the single segment in which it starts, it is also assumed that in the following segments (not necessarily the same one, since segments are fully associative), the remaining data of the block starts at offset zero, i.e. there must not be no “hole” inside compressed block data. In addition, to track where the last block of the super block ends, the super block tag is also extended with an “end” field, which stores the last byte of the current super block. To support fully-associative tag mapping and storing multiple blocks within one segment, now each segment back pointer consists of the tag ID field, which is unchaned from the baseline DCC, and a bit vector field, “sharers”, which tracks the blocks that are stored in the segment. On a cache lookup, a block can be read from the segments satisfying the following conditions: (1) Tag ID matchs the tag of the requested address (assuming the tag also hits); (2) the segment back pointer has the block set in the sharer list. The compressed block data begins in the first segment with the sharer bit set, at byte offset indicated by “begin” field, and ends in the last segment with the sharer bit set. The actual size of the block should be indicated by the compression type field. Note that the “end” field is not necessary in reading the blocks from a super block. When allocating new space from a partially filled segment, however, the “end” field is used to identify the beginning offset of the next block, which will then be adjusted.