HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2830772.2830823
Year: MICRO 2015
Keyword: Cache; Compression; HyComp; Hybrid Compression; Floating Point Number Compression
Highlight:
-
Using heuristics to determine data type and perform type-specific compression; Not using multiple circuits in parallel.
-
Using simple algorithm to compress high bits of floating point number’s mantissa field
Questions
-
Some applications tend to use 64 bit integers while others use 32 bit integers. If the heuristic always checks bits for 64 bit integers, 32 bit integer arrays would be mis-classified into pointers. Although BDI also handles 32 bit integers very well (BDI is designed for both sizes), the paper should clarify this part. The same applies to 32 bit float vs 64 bit double.
-
The hardware overhead is higher than any of the previous schemes, since it implements four different compression logic. In addition, both SC2 and FP-H needs two copies of the codebook, which increases the overhead more (they cannot share). Parallel decompression of exponent and mantissa fields only make this worse.
-
Stack pointers often have higher bits setting to 1 (canonical addresses in x86). This paper’s classification algorithm would fail to recognize them
This paper proposes HyComp, a cache compression framework with high compression ratio for all data types using multiple compression algorithms. The paper is motivated by the observation that most compression algorithms are only capable of compressing a certain with high compression ratio, while leaving data of incompatible types less compressed, introducing a huge bias depending on the application and the data types used. The difference between compression capabilities is a consequence of different assumptions on sources of redundancy. For example, FPC is based on the assumption that most redundancies are caused by small integers whose upper bits are all ones or all zeros. It works badly for pointers, which typically contains addresses of user space data or stack segments. BDI, on the other hand, assumes that redundancies are introduced by dynamic value locality of nearby values. In most cases, value locality is demonstrated by small integers, or pointers of similar sized objects allocated from the heap. BDI, on the other hand, cannot process large integers, hetrogeneous data types or floating point values very well, since these values differ by a large amount in their numeric literal. Huffman encoding, as used in SC2, only compresses values that frequently occur throughout the execution with shorter codewords. It cannot handle value locality in the form of nearby values, which is common for pointers. Lastly, special optimization can be applied for zero blocks, i.e. a cache block with all-zero, which is common for initialized data structure and sparse matrix. Although all schemes described above can achieve a relative high compression ratio for zero blocks, Zero Cache Augmentation (ZCA) compresses zero blocks at the highest compression ratio of one single bit per block, which is ideal to zero-dominant workloads. None of the above compression algorithms can handle floating point values well, since floating point values consists of three fields: sign, exponent, and mantissa, which satisfies none of the compression assumptions above.
Instead of using one fixed compression algorithm for all kinds of possible data types that may occur, HyComp implements four different compression algorithms, namely, SC2, BDI, ZCA and a floating point compression algorithm called FP-H. Heuristics are employed to determine the data type within an uncompressed cache line. Once the data type is determined, HyComp will compress the block using the most suitable algorithm for that type, achieving higher compression ratio than any of the single algorithm for all workloads.
We next describe the operation of the cache architecture. HyComp is implemented for the LLC. Cache lines fetched from the main memory or evicted by upper level caches should be compressed before installed into the LLC. Similarly, cache lines evicted or fetched by upper levels should be decompressed before they are sent over the network. The paper observes that decompression is on the critical path of upper level data fetching, which can affect performance, while compression is in the background, and it is unlikely that several more cycles added to the latency will negatively impact performance. In order to hold several compression blocks per physical slot, each data slot is equipped with two tags which statically map addresses to the data slot. In addition to the conventional bits such as dirty, valid and coherence bits per tag, an offset field, compressed size field, metadata field and compression type field are also added for placement and decompression of compressed blocks. The offset field serves as an indirection pointer to locate the compressed block in the physical slot. The paper suggests that word-aligned pointers or byte-aligned pointers are both fine, making a trade-off between external fragmentation and static metadata cost. The compression metadata is a three-bit field that stores algorithm-specific information for decoding the compressed block. For BDI, it stores the three-bit compression type. For ZCA, it stores whether the block is a zero block (zero blocks do not use data slots). For SC2 and FP-H, it stores the version since two instances of the codebook can be active at the same time. The two-bit algorithm selection field selects the decompression logic via a multiplexer. The compressed block is sent to the corresponding decompression circuit based on the value in this field.
The cache access protocol does not change much except that tag lookup and data slot access are serialized due to the extra level of indirection. When a dirty block is written back from high levels, the compression circuit first encodes the block into a shorter form, and then compares the size between the old block and the new block. If the new block fits into the old block’s storage (either smaller, or there is extra storage after the old block), the write back completes without any eviction. If, on the other hand, the new block cannot fit into the old block’s store, or there is no free tag, at least one LRU tag, together with its data, must be evicted. In the worst case, two tags are evicted to free an entire physical slot. The paper also mentioned that zero block tags should be given less priority when determining the eviction victim, in addition to conventional LRU, without much elaboration on this.
The heuristics works as follows. When an uncompressed block is to be processed, the cache controller first guesses the data type of values in the block, and then selects the best compression algorithm based on type information. Data type is determined by inspecting bits from each aligned 64 bit word (called “chunks”) in the block. A word is classified as a small integer, if the high 32 bits is all-zero or all-one, which is compared with SC2. A word is classified as a pointer, if the high 16 bits are zero, but bits 32 - 47 are non-zero. The paper claims that this prevents small integers from being recognized as pointers. Floating point numbers are recognized by checking the next 7 bits after the MSB of the word, which consititute the exponent. The circuit compares the exponent field of each word with its neighboring two (or one, if on the border) words, and classify the current word as a floating point value if the exponents match. The paper justifies this by arguing that value locality also applies to floating point numbers: The exponent fields of adjacent values are often identical with each other. Unfortunately, the paper did not elaborate on the pattern that these exponents are compared. At last, if a block is all zeros or all ones, it is classified as a zero block, which does not get compressed, but simply sets the compression type to zero block compression in the tag array. For each of the four types, a counter is incremented when a certain type word is recognized. The final decision is made by selecting the compression type with the biggest counter value.
The paper also proposes a floating point number compression algorithm, called FP-H. The observation is that although floating point values have significant variations on their raw bit patterns, due to IEEE 754 encoding standard, the exponent field as well as higher bits of the mantissa field still demonstrate high value locality. Instead of compressing floating point numbers as a whole, FP-H decomposes each floating point number as sign bit, exponent, mantissa high, and mantissa low. The latter two are just a even cut on the mantissa field, where the higher parts are expected to have better locality, while lower parts have less. The algorithm then groups each of the four fields into four partitions, and compress them separately with Huffman encoding (except sign bits, which are always stored uncompressed). The paper suggests that the codebook used for Huffman encoding can be generated by software after an initial profiling stage. Evidence exists showing that the codebook seldom changes during the execution, and therefore, the sampling is only conducted at the first few seconds of starting a new application.
For decompression, one thing that may become problematic is the serial nature of Huffman decoding, due to the fact that Huffman codes are variably sized. The next code cannot be read before the first code is fully matched in the codebook. Fortunately, since the three partitions are compressed separately, the paper suggests that the exponent and one of the two mantissa partitions can be decompressed serially, since the offsets to these two partitions are known. To achieve this, the mantissa partition that concludes the compressed block should be stored in a bit-reversed manner, i.e. the first bit in the compressed stream is stored as the last bit in the compressed block. The decompressor then reads the stream from both ends of the block to start decompression. As an alternative, the paper also suggests that offsets to each of the two compressed mantissa partitions be explicitly stored, enabling three way parallel decompression, at the cost of extra metadata and decompression hardware.