C-Pack: A High-Performance Microprocessor Cache Compression Algorithm



- Hide Paper Summary
Paper Title: C-Pack: A High-Performance Microprocessor Cache Compression Algorithm
Link: https://dl.acm.org/doi/10.1109/TVLSI.2009.2020989
Year: TVLSI 2010
Keyword: Compression; Cache Tags; C-Pack



Back

Highlight:

  1. C-Pack combines both pattern matching compression and dictionary compression
  2. The block-based dictionary is simpler to implement in hardware rather than the standard LZW compression, since C-Pack dictionary is always a function of already compressed bytes, rather than having to lookahead by one byte.

This paper proposes C-Pack, a cache compression architecture featuring realistic design assumptions and validiation steps that actually fit into processor’s design flow. The paper points out that existing cache compression algorithms either make unrealistic assumptions about the design complexity or compression/decompression latency, making the algorithm difficult to implement and deploy. In addition, some compression algorithms are designed to be used with large blocks in order to find sufficient number of patterns, which may not work well with 64 byte cache lines, and often requires an extra indirection or caching level.

The paper consists of two parts. The first part discusses the hardware implemented compression algorithm, which is a hybrid of pattern-based encoding and dictionary-based compression (like LZW). The second part of the paper discusses the compressed cache organization, which uses a restricted tag mapping scheme to achieve low cost cache management.

The compression algorithm combines pattern matching and dictionary encoding, which will be discussed as follows. Logically speaking, compression is performed on 32 bit words, which are processed one-by-one. The compression engine maintains a dynamically generated dictionary to translate an incoming word into a smaller code. Thedictionary is implemented as a fixed size CAM array of 32 bit entries, with a comparator at each entry. The size of the dictionary affects several aspects of the design. First, a smaller dictionary may not be sufficient to encode all patterns, which decreases the efficiency of compression, since some entries will be evicted before they can be used to encode an incoming word. Second, if the dictionary is fully associative, then dictionaries whose sizes are larger than cache line capacity makes little sense, since in the worst case none of the dictionary entry serve as patterns for encoding, resulting in the entire cache being inserted into the dictionary. If, however, fully-associative dictionary is not an option, suggesting set-associative dictionary design, dictionaries larger than 64 bytes are useful, since code words may not be evenly distributed into all sets, justifying some degrees of redundancy. The last point is that the size of the encoded pattern in the diationary is a logarithm of dictionary size in number of bits. For example, a dictionary of size 2^k must be addressed using k bits. A large dictionary may benefit on one hand, but on the other hand they also decrease the compression ratio by forcing more encoding bits. The paper suggested that the optimal dictionary size is 64 bytes (16 entries), which can be encoded using 4 bits.

When an entry is to be inserted into the dictionary, and the dictionary is full, an existing entry must be evicted as in associative caches. The eviction candidate can be selected using common policies such as LRU or FIFO. Any deterministic policy will work. Non-deterministic policies, such as random, however, could not be used, since decompression must replicate the full process of constructing the dictionary while decoded words are inserted into the dictionary.

The compression algorithm is described as follows. First, the input cache line is divided into 32 bit words, which is then fed into the compression circuit. The encoded output consists a stream of code words. Each code word is prefixed with two or four bit code word type. Two bit code types are used for patterns and/or dictionary matches with higher frequencies, while four bit code word types are used for cases with lower frequencies. Note that there is no three bit code word type field to reduce encoding complexity. Note that the actual hardware processes two words at a time, but the operation is equivalent to processing input words one-by-one. For each input word, the compression circuit checks, in parallel, for both fixed patterns such as zeros and for dictionary matches. C-Pack supports two fixed patters: All zero and small one-byte integer with leading zeros, with code word types being 00 and 1101. If the encoding circuit detects such a pattern, it will output the code word with the type field as prefix. For all-zero case, only the type field is sufficient for decoding. For zero-prefixed small integer case, the code word type field is followed by the lowest byte. This byte will be padded with three bytes of zeros on higher bytes during decompression.

Dictionary matching are slightly more complicated. The dictionary is not explicitly stored with the compressed line. Instead, the compression algorithm builds the dictionary in a deterministic manner as it streams through the input words. The same dictionary is reconstructed at decompression time using exactly the same algorithm for compression. The algorithm guarantees that the content of the dictionary after encoding word at index k is a function of word k and all previous input words, i.e. as long as we have recovered words till index k, the content of the dictionary can always be rebuilt, since all arguments to the dictionary genetation function are known.

The dictionary is constructed as follows. When an input word is not encoded using fixed pattern, the dictionary comparison logic will either signal a match, a partial match, or no match, to the hardware encoder. The dictionary comparator consists of a comparator array for each valid entry in the dictionary (or each valid entry in the set, if the dictionary is set-associative). Each comparator checks the input word against the entry for full or partial match. A full match means all four bytes in the input word matches the entry, which will output 10 type prefix followed by the four bit dictionary index. Partial matches are further classified into two bytes prefix match and three bytes prefix match. “Prefix match” means that only higher bytes are matched, but not lower bytes. In the former case, type field of 1100 is output, followed by the two lower (uncompressible) bytes and the dictionary index. In the latter case, 1110 is output, followed by the one lowest byte of the input word and the dictionary index. If none could be found, all four bytes are output, with type prefix being 01. In all cases above, the input word is inserted into the dictionary, potentially evicting an existing entry if the dictionary is already full. If the dictionary is fully associative and the size is larger than cache line size, no eviction is ever possible, since at most 16 bytes will be inserted.

The actual hardware processes two input words at a time, checking both words for fixed patterns and dictionary matches in parallel. A data hazard may happen, however, if the second word would match an entry inserted by the first first word, or if the second word matches an entry evicted by inserting the first word. In both cases, the result would be different from encoding the two words serially. To solve this issue, when encoding the second word, we also compare it with the first word, and disregard the entry to be evicted from the dictionary, if the first word does not match a fixed pattern. Note that if the first word matches a fixed pattern, then the dictionary would not be changed by encoding the first word, so both operations could take place in parallel without any hazard.

The paper also proposes a cache tag mapping scheme that works with the compression scheme. The paper observes that a fully associative mapping between tags and segmented data slots (as in segmented cache designs) only brings marginal benefit, because the segmented cache is often fragmented, requiring constant compaction and rearrangement. In addition, replacements in segmented caches often cannot follow the LRU order, since multiple lines may be evicted for a line fill request. The paper simplifies the segmented cache design by only over-provisioning 2x the number of tags as in regular cache. Data slots are still statically bound to tags except that each data slot can be mapped by two tags. As a result, at most two compressed lines can fit in the same physical slot, achieving a best compression ratio of 2:1. On a line fill request, the cache controller first attempts to find a slot that still has sufficient number of bytes to hold the new line. If true, the newly fetched line is brought into that slot without any eviction. Otherwise, at most two compressed blocks are evicted from a physical slot.