SC2: A Statistical Cache Compression Scheme



- Hide Paper Summary
Paper Title: SC2: A Statistical Cache Compression Scheme
Link: https://dl.acm.org/doi/10.5555/2665671.2665696
Year: ISCA 2014
Keyword: Cache; Compression; Huffman Encoding



Back

Highlight:

  1. Delegating tasks that are complicated or impossible to do to idle software threads.

Questions

  1. The paper is inconsistent on whether blocks should be stored compactly in the data slot. It is suggested that the compressed size could be determined by taking the difference between the two adjacent tags. It is, however, also indicated in Sec. 3.3 that “Contrary to ECM, compaction runs in the background after the new (or updated) block has been written”. What if an access is made before compaction takes place? Do we get the wrong size of the compressed block?

This paper proposes SC2, a cache compression design taking advantage of Huffman encoding to achieve both high compression ratio and low decompression latency. The paper observes that most conventional cache compression schemes suffer from low compression ratio, since they typically trade-off decompression latency with the efficiency of the compression algorithm. This paper, however, proposes that Huffman encoding be used as the compression and decompression algorithm, despite the common knowledge that decompression with Huffman encoding is unsiutable for hardware compression.

The paper identifies two major challenges of using Huffman encoding with hardware cache compression. First, unlike dictionary based encoding where the dictionary is derived from data that has already been decompressed, Huffman encoding requires a codebook be used as the reference during both encoding and decoding. The generation, maintenance and switching of the codebook pose a challenge, since the complexity of these operations just grows as the number of symbols become large. Second, Huffman encoding, if not generally carefully, may require traversing a Huffman encoding tree for decoding a code word, since code words are variable sized. Tree traversal operations are resource hungry because they involve random walks on a Huffman tree structure.

The paper solves both issues with Canonical Huffman encoding and software managed codebook, as we discuss below. Canonical Huffman encoding is generated by first building a Huffman encoding tree with conventional methods. For example, thie paper suggests that a heap be used to extract the next-minimum subtree in terms of frequency. After building the encoding tree, the codeword for each symbol is obtained by counting the number of pointer jumps from the root to the leaf of the symbol, and then sorted based on the codeword bit length. The actual bit pattern is unimportant. After sorting codewords, a transform is applied to each subset of symbols of the same codeword length as follows. The first set of symbols, which contains the shortest of the codewords, are assigned codewords 00..00, 00..01, 00..10 in which the number of bits are identical to the size of their non-canonical codewords. For later subsets, their canonical Huffman codewords are generated in the same manner, except that the first codeword in the subset is not all-zero, but the last (largest) codeword in the previous subset plus one and left shifted to the current codeword length, padding lower bits with zeros. The “plus one and shift” operation is to ensure that the codewords of the current subset do not have any prefix overlapping with codewords in the previous subset. This relation can be recursively passed down to the first subset, resulting in an encoding that no codeword is the prefix of another.

The numerical sequence property (i.e. each codeword in the subset is simply the previous codeword plus 1, and they occupy a consecutive range in the integer value domain) of canonical Huffman encoding codewords simplifies decoding. As long as the length of the codeword is known, its symbol can be obtained by a table lookup consisting of the symbols in the subset sorted in the same order as when the codeword is generated, and the numeric value of the first symbol’s codeword. Decoding is peformed by subtracting the first symbol’s value from the current codeword, and then use the result to index the table of symbols in the subset. Only two table accesses are required for decoding, instead of jumping nodes on the Huffman decoding tree.

The implementation of the Huffman encoding logic consists of three compnents. The first component is Value Frequency Table (VFT), which samples inputs of the LLC as an approximation of symbol frequencies. The second component is software algorithm for building the Huffman tree and generating canonical Huffman encoding as described above. The last component is hardware Huffman decoder and decoder lookup tables (DeLUT). We next introduce these components in details.

The VFT is a set-associative structure mapping symbol values to their freqnencies. Each slot is a frequency counter for the number of times a particular value is entered into the LLC. The VFT is accessed and maintained exactly like a cache, indexing using the lower bits of the value. Values are aligned to 32 bit boundaries. New values are entered into the VFT when they are: (1) Fetched from DRAM, and (2) Written back from the upper level cache (only dirty words are entered). Existing values are incremented by one in either case. Set and capacity conflicts during insertion are resolved by evicting the entry with the minimum counter value in the set. When a block is evicted, all values in the block are decremented, if they exist in the VFT. The system enters a training phase when an application starts. After certain number of instructions, training is finished, and the content of the VFT is frozen. The table is then accessible to software threads for building the Huffman tree and generating the codeword, as we will see below.

Due to the complexity of the codeword generation process, the the paper proposes that the code generation work should be delegated to software. A background thread dumps the freqnency values from the VFT to a memory region. For each symbol tracked by the VFT, a Huffman codeword is generated based on its frequency using the algorithm described above. Symbols that are not in the VFT represent low frequency values, for which a special codeword is generated, which is used as a prefix to encode uncompressed values. When the special codeword is seen, the decompressor interprets the 32 bit value after it as a uncompressed symbol. This may result in compressed data being longer than the original data. To cap the compressed size, the cache line is stored in an uncompressed form if the compressed size is larger than 64 bytes.

The decoder uses two lookup tables to map a selected codeword from the input bit stream. The first table, as we have already described, DeLUT, stores the symbols in the same order as they were used for codeword generation. The second table maps a codeword length k to the offset of the subset of symbols in DeLUT, as well as the numeric value of the codeword of the first element in that set (the starting value of length k). The numeric value is subtracted from the codeword, which is then used as an in-set index to fetch the symbol from DeLUT. Codewords are also selected from the stream by comparing the first k bits in the stream to the starting value of length k. For all possible codeword lengths, there is a hardwired comparator configured with the starting value of the corresponding length. If the first k bits of the input stream fall into the range, a k-bit codeword has been detected, which is then clipped from the input stream for further decoding.

Although the paper claims that code genetation only needs to be performed rarely during the execution, there are cases when a single code book will not cover the entire execution, which can be allievated by changing the code book. The paper proposes that two instances of the decoding logic be used in parallel, one for the old code book and the other for new. One extra bit is added to the tag to indicate which decompressor to use. After a new code book is generated while the existing one is still in-use, the new DeLUT and mapping tables are copied to the idle instance of the decompressor, and compressed blocks will be routed to that decompressor if the bit indicates so.

The cache is organized similar to a segmented cache in conventional compressed cache designs. Tags are extended with an extra pointer to the unified data slot. The only difference is that SC2 uses byte-level alignment to reduce internal fragmentation. Cache lines brought from the DRAM or written back from the upper level are compressed before they are placed to the data slot. The paper seems to suggest that blocks must be stored without gap between each other, indicating frequent compaction. In practice, if this is not the case, then each tag should also contain a length field, since the compressed block size could not be determined from the compressed block size.