Yet Another Compressed Cache: A Los-Cost Yet Effective Compressed Cache



- Hide Paper Summary
Paper Title: Yet Another Compressed Cache: A Los-Cost Yet Effective Compressed Cache
Link: https://dl.acm.org/doi/10.1145/2976740
Year: TACO 2016
Keyword: Compression; Cache Compression; YACC



Back

Highlight:

  1. Using super blocks to implicitly generate addresses on lookups without adding more tags, since the address of cache lines within a super block is easily derived from the base super block address.

  2. Using different associativity for 4:1 case and 2:1 or uncompressed case. 4:1 blocks are statically mapped since each block can be statically allocated a 16B slot. 2:1 and uncompressed case requires fully associative mapping between block ID and physical slot offset since not all blocks in the SB can be allocated physical storage. Using different layouts for different SB cases saves some bits for the tag.

  3. Using flexible layout for metadata field rather than fixed location for each field, no matter whether it is used in the current configuration or not.

Questions

  1. I understand the motivation of this article as being a compromised design between complicated SCC/DCC and uncompressed cache. But without these motivations YACC is just a really common super block compressed cache with little novelty involved.

This article proposes Yet Another Compressed Cache (YACC), a compressed Last-Level Cache (LLC) design that improves over the previous overly complicated DCC and SCC designs. The paper begins by pointing out that both DCC and SCC designs are overly complicated and thus difficult to be adopted into commercial systems for several reasons. First, DCC decouples the static mapping between tag array and data array entries as in the conventional cache, relying on forward and backward pointers to read data associated with a tag entry and/or locate the tag for given compressed slot. This not only increases area and power consumption, but also incurrs extra latency on the critical access path. Second, SCC takes advantage of different compression ratio of blocks within the same supre block (SB). Ways in the LLC are further divided into smaller way-groups, with each way-group only eligible for storing compressed blocks in the SB of a certain size class, called Compression Factor (CF). By dispatching blocks with different CFs to different way groups, the data slot layout within a way group for that SB is fixed, which eliminates complicated mechanism for placing compressed blocks in the same physical slot, since all slots sharing a physical slot must be of the same CF. In addition, to further eliminate the overhead of having to identify blocks in an SB for a physical slot, SCC only maps adjacent blocks of the same CF to the same physical tag. Non-adjacent blocks, even with identical CFs, will be mapped to different sets in the way group. Finally, to offset the fact that way group design reduces effective associativity for a certain address, SCC also adopts the skewed cache design which uses different hash functions in different ways to scatter addresses that would have conflicted in a set to distinct sets. The article argues that, with all three levels of skewness in SCC, the decoder array is much more complicated than in a conventional cache, increasing design and verification cost. Besides, as with all skewed cache designs, conventional replacement algorithms cannot be applied, since skewed caches do not honor the concept of sets, set-based replacement algorithms such as LRU or RRIP will not work, further complicating the design since a new algorithm has to be implemented.

YACC essentially de-skews SCC while still adopting the super block abstraction for its simplicity and low metadata cost, as in DCC. What differs from DCC is that the tag array is still statically mapped to data slots to further reduce metadatas overhead, since both forward and backward poiners can be got rid of. YACC tags physical slots with super block addresses, which consists of four aligned regular 64 byte cache lines. Individual cache lines are still compressed individually for fast compression and decompression. Although the paparticle does not elaborate the compression algorithm it uses, it is suggested that a variant of CPACK optimized for zero words are used. To simplify address mapping within a physical slot, YACC mandates that blocks stored in the same physical slot must be compressed into the same size class. Three size classes are considered in YACC, namely 16 byte blocks (4:1 compression ratio), 32 byte blocks (2:1) and 64 byte blocks (uncompressed). Blocks of different size classes within the same super block must be stored separately in distinct physical slots. To this end, YACC allows multiple tag entries in the same set to have the same super block tag, each holding distinct cache lines in the super block. Individual cache lines, however, are only allowed to be stored in at most one location as in a conventional cache.

Addresses in memory requests are mapped to cache sets and ways as follows. Since YACC maps all cache lines of a super block into the same set, the index bit field that consitute the set index must “move up” by two bits, leaving the lowest two bits of the cache line address as block offset. The remaining high bits are used as the super block tag for tag matching. Note that this address mapping scheme, although greatly simplifies cache line addressing under the notion of super blocks, map adjacent blocks in the super block into the same set. For workloads with high spatial locality but low overall compressibility, this address mapping scheme will negatively impact performance, since adjacent cache lines will be mapped to the same set, each occupying one data slot, causing excessive set conflict misses (in conventional address mapping schemes they will be mapped to adjacent sets since the low two bits are used).

We next describe the data and metadata layout. As discussed above, in addition to the super block tag in the tag entry, the size class of compressed lines in the physical slot is also stored. Furthermore, the per-line coherence state (3 bits) are also needed for each compressed lines. For size class of 16 bytes, the 64 byte physical slot is divided into four equal sized partitions, one for each compressed cache line in the supber block. The mapping between block ID and physical slot offset is static and straightforward, since all four blocks can be accommodated. For size class of 32 bytes, the block ID must also be stored in the tag, since only two out of four compressed lines can be stored. The same is true for 64 byte uncompressed lines. The block ID field takes two bits per line, and four bits in total are needed in the worst case.

To efficiently track block metadata under different conditions, YACC uses different layouts for each configuration. A total of 13 bits are prepared, with the intepretation of bits depending on the size class of the current slot. In the case of 16 byte compressed lines, the first bit stores the size class, which is set to “0”. In the case of 32 and 64 byte lines, the first bit is set to “1”, and the second bit distinguishes the former from the latter. For the 16 byte line case, the next 12 bits store coherence information for four compressed lines, with three bits per line. For 32 byte line case, the next 10 bits store the two bit block ID and three bit coherence information for each of the two lines, totalling to 12 bits. For 64 byte uncompressed lines, the next 5 bits store the block ID and coherence information of the line, which uses only 7 bits out of the 13-bit field.

YACC operates as follows. On a lookup request, the cache controller locates the set using middle bits from the requested address as described above, and “expands” each of the tag to a cache line address, by concatenating the super block tag with block ID. As discussed above, block IDs are either explicitly or implicitly stored, which can be easily derived. Address check is still performed on the cache line address level between the requested address and the expanded addresses. Read requests are satisfyied by decompressing the hit block first before sending it upwards. On a write back request, the controller first compresses the block. If the write back address hits the cache, or there is a free slot in the same super block of the same size class, the compressed line is written into the slot if it still fits. If the compressed line no longer fits into the slot, or addresses do not match (a write miss), a new entry is allocated by evicting lines from an existing data slot. Note that at most four lines need to be uncompressed and written back to the lower level in the worse case. To avoid eviction becomes the bottleneck, the paper proposes that each way have a separate decompressor and compressor for handling evictions, write backs and read requests in parallel.