Dual Dictionary Compression for the Last Level Cache

- Hide Paper Summary

Paper Title: Dual Dictionary Compression for the Last Level Cache
Link: https://dl.acm.org/doi/10.1145/2540708.2540735
Year: ICCD 2017
Keyword: Cache Compression; Dictionary Compression

Back

Highlight:

Statically compressed values (i.e., only using pattern-based compression without referring to a dictionary entry) can be transferred in a compressed form, because they can be easily decompressed without the dictionary.
Separate dictionaries can be used for compressing LLC and main memory. This has several advantages, such as easier dictionary and algorithm design, less coupling, etc.
Values compressed with a dictionary entry can be transferred by rewriting the indexing of entry the compressed code word references with the corresponding index on the receiving side. If the entry does not exist on the receiving side, the dictionary entry may also be sent with the compressed block to update the receiving side dictionary.

Comments:

While sharing similarities with C-PACK, the compression algorithm presented in this paper is not C-PACK, because C-PACK is an implicit dictionary algorithm with a per-block, dynamically generated dictionary per compression and decompression.
The paper did not cover how main memory side dictionary is updated and how recompression works. Other works (e.g., GBDI) have presented solutions like recompressing on DRAM refresh.

This paper proposes Dual Dictionary Compression (DDC), a compressed cache and main memory architecture that leverages two separate dictionaries for compressing the cache and memory data, respectively. The paper is motivated by the memory bandwidth saving benefit of transferring compressed blocks on the bus, which most previous proposals on cache compression do not focus on. The design aims at maintaining data in compressed form both in the cache and in the main memory for maximum compression benefit. Data is compressed using different dictionaries located at the LLC and the main memory level, respectively, and compressed data is transferred on the bus in compressed form using a technique called dictionary index swap.

The paper begins by noticing that existing cache compression designs only focus on expanding the logical capacity of the cache itself, and the reduction of memory traffic as a result of lowered miss rates. Meanwhile, little attention is given to the bandwidth saving aspect of compression via transferring data blocks in compressed form, which also effectively reduces memory traffic, because less number of bits are transferred for compressed blocks. In addition, cache compression may also cooperate with main memory compression to further expand the logical capacity of main memory storage. The latter requires extra metadata and memory controller logic, and this paper solely focuses on leveraging the first opportunity, i.e., reducing memory bandwidth consumption by transferring blocks in compressed form.

DDC is based on explicit dictionary compression, in which an explicit dictionary is maintained to serve as the reference for compression. The explicit dictionary can either be statically or dynamically generated, and dictionary entries are supposed to be frequently occurring values in the working set. The dictionary is stored in a associative lookup structure (e.g., a CAM), with each entry containing the value and other metadata, such as the valid bit, the decay bit (which we describe later), and so on. Compression works by comparing the value from the input stream with dictionary entries, and the value matches an entry, or can be compressed with a small delta with the entry, then the value is stored as the index to the entry, plus the optional delta value. Compression is achieved in this case by encoding the index and the small delta value in less number of bits than an uncompressed value. The algorithm also performs what the paper calls “static compression”, which is essentially pattern-based compression that matches the input value with common patterns. The value is compressed without any dictionary entry, but instead, with the pattern ID and other necessary information for restoring the original value. Decompression for dictionary-compressed values require the same dictionary, and restores the original value by adding the dictionary entry’s value with the small delta (if any). Statically compressed values are restored based on the pattern ID and the information generated during compression.

Note that although the paper claims that they adapted C-PACK, the actual dictionary algorithm presented in the paper differs from C-PACK by using an explicit dictionary, rather than implicitly generated, per-block dictionary as is the case with the original C-PACK. On the other hand, the explicit dictionary algorithm presented in this paper resembles C-PACK by adopting both dictionary compression and pattern-based compression, which generally leverages a wider range of value redundancy.

DDC also assumes a compressed hierarchy in which both the LLC and the main memory stores data block in compressed form. DDC uses two separate dictionaries for blocks in the LLC and in the main memory, respectively, and hence the name dual-dictionary compression. Due to larger amount of data in the main memory, the dictionary on the main memory side is also bigger with more entries. This enables the main memory compressor to potentially encode more values with dictionary entries. The paper recommends 64 entries for the LLC dictionary, and 128 for the main memory dictionary.

The LLC side dictionary is dynamically updated as the compressor encounters a new value that is not in the dictionary and if the dictionary is not full. This dynamic behavior helps the dictionary to adapt to a changing workload. The paper also points our that with a dynamic dictionary scheme, the dictionary entries cannot be easily replaced, because existing blocks in the cache may still need the entry to be replaced in order to be decompressed. To address such complication, both the dictionary and the LLC use a “decaying” mechanism which we describe briefly below. A cache block decays if it has not been accessed for a while (i.e., in the last “decaying period” which is typically a few thousand cycles). A decayed block is evicted from the LLC at the end of the decaying period. Similarly, a dictionary entry decays, if none of the blocks using the entry has been accessed in the last decaying period, which results in the eviction of the dictionary entry. Since cache blocks also decay at the end of the decaying period, this way the dictionary entry that is replaced is guaranteed to be free from any reference in the LLC. The decaying mechanism is implemented as having one “decay” bit per block and per dictionary entry that indicate whether the block or the entry should decay. The cache controller periodically sets the bit at the end of every decaying period. During execution, if a cache block is accessed, its “decay” bit is cleared, and so are the “decay” bits of all dictionary entries that the block references. At the end of the decaying period, the cache controller simply evicts cache blocks and dictionary entries with the “decay” bit still set.

With a dynamically updated LLC side dictionary, and a compressed main memory storing blocks in the compressed form, the paper identifies the main challenge as keeping the dictionary consistent for both in-cache and in-memory blocks. Problem would arise if a block is compressed with the LLC side dictionary, written back to the main memory in a compressed form, and then the LLC side dictionary is updated with new entries. While the LLC remains consistent, thanks to the decaying mechanism, the main memory blocks might be corrupted, since the dictionary entries that are needed for decompressing the block may no longer exist when the block is read back to the main memory.

To address this issue, the paper proposes that data blocks stored in the main memory should be compressed using the main memory side dictionary, which can potentially deviate from the LLC dictionary. The main memory dictionary may also be slowly updated to reflect long term drifting of data patterns, but such updates are rather expensive, since potentially many data blocks stored in the main memory need to be recompressed with the updated entries (and the paper did not cover this part).

To save memory bandwidth, data blocks evicted from the LLC are transferred on the memory bus in a compressed or partially compressed form. The challenge here is to enable correct decompression of the block on the main memory side, as the main memory side dictionary is likely to contain different entries than the LLC side dictionary. To address the challenge, DDC leverages two properties of its dictionary compression algorithm. First, some values are statically compressed, i.e., without reference to the LLC side dictionary. These values can be transferred in the compressed form, and the decompressor can easily restore the original value without a dictionary. Second, for those that are compressed with an index into the LLC side dictionary, the paper proposes an index rewriter on the eviction data path to “fix” the index by replacing it with the corresponding index into the main memory side dictionary. The rest of the values not covered by the above two properties must be decompressed before they are transferred.

To facilitate index rewriting, every entry of the LLC side dictionary is extended with an extra field that stores the corresponding index of the entry’s value in the main memory side dictionary. This extra field is kept in synchronization with the main memory side dictionary when the latter is updated. If the entry does not exist in the main memory side dictionary, indicated by a special value in the extra field, the entry itself may also be sent in the same message with the compressed block. The main memory side dictionary may update its own dictionary using the entry in the message.