Dictionary Sharing: An Efficient Cache Compression Scheme for Compressed Caches

- Hide Paper Summary

Paper Title: Dictionary Sharing: An Efficient Cache Compression Scheme for Compressed Caches
Link: https://ieeexplore.ieee.org/document/7783704
Year: MICRO 2016
Keyword: Compression; Cache Compression; DISH

Back

Highlight:

Using fixed length code words for compression, which results in a fixed compression ratio (4:1). This eliminates problems such as space allocation and compaction, since all lines are statically mapped to only one possible location in the physical data slot.

Questions

This paper is syntatically and grammaritically poor, and in general the delivery of ideas is also poor.
I don’t get why adding another line for compression is that complicated (3 parallel comparisons for B1 and 4 parallel comparisons for B2). Shouldn’t we just load the current dictionary, treat it as if we were half way done during compression, and compress the incoming line by either generating a code word for existing dictionary entries, or inserting a new entry? Why dividing the dictionary into “B0’s dictionary” and “B1’s dictionary”?
When a line is invalidated, how to invalidate dictionary entries that no longer contribute to the compression for the rest of lines? One way is just to use brute-force and compute the reference count of each dictionary entry. Entries with ref count being zero should be removed after invalidating the line. Dictionary entries do not need to be compacted since each entry has a valid/invalid bit.

This paper proposes Dictionary Sharing (DISH), a novel dictionary-based cache line compression scheme with high compression ratio and low design cost. This paper identifies several issues with previous compressed cache proposals, such as DCC, SCC and YACC. First, to simplify storage management within a physical slot, these designs mandate that all cache lines that share a physical slot must be from the same super block, and must be of the same size class after compression. Under such restriction, even if a physical slot still has storage for one compressed line, if the size class of the line does not match the size class of the slot, the line cannot be stored, casuing space wastage. In the worse case, N tags have to be allocated for lines in the same super block, if these blocks fall into N distinct size classes. Second, when a dirty cache line is written back from the upper level, chances are that the compressed size of the new line no longer fits into its original physical slot. In this case, the old line’s physical slot is invalidated, and a new slot is allocated by either evicting existing lines, or performing compaction of existing lines to free some storage that was unusable due to external fragmentation. This paper argues that compaction of existing lines is a heavy-weight task, since it reads out all compressed lines from the dara array, re-aligns them, and writes the compacted block back. Lastly, despite the fact that these previous schemes perform tagging in super block granularity, which consist of four or more regular cache lines, the compression itself is still performed on individual cache lines without any consideration of inter-block redundancy or block compaction. In fact, inter-block redundancy within a super block is a good starting point for further optimization, as we will see below.

DISH leverages two important observations on value locality. First, value locality not only exists in the same cache line, but also across adjacent cache lines within the same super block. The number of distinct values is significantly smaller than the number of words in a super block, potentially enabling a compression method that uses these values as dictionary, and encodes individual words using dictionary index with shorter code words. The second type of value locality, called upper data bits locality, suggests that even if values across cache lines may not be strictly identical, there are still large chances that the upper bits of these values are equal, while lower bits being more random. This is common in arrays, especially pointer arrays of the same type of objects, since malloc() will most likely attempt to place these object in the same memory pool as close to each other as possible as spatial locality optimization. Other compression algorithms that perform delta encoding, such as BDI, also take advantage of this observation by subtracting a selected base from such values and encoding the delta with less number of bits. DISH, on the contrary, only saves the higher bits in the dictionary which could be indexed with shorter code words in the compressed data. The lower few bits are just stored as-is together with the index array, which will be concatenated with dictionary entries to recovery the original uncompressed value.

We describe the overall architecture of DISH before introducing its dictionary-based compression algorithm. From a high level, DISH attempts to compress at four block super block level by extracting common 32-bit words (or partial words) to form the dictionary, and encoding 32-bit values in cache lines with shorter code words. In the best case, all four blocks can be compressed into the same physical slot. One of the biggest difference between DISH and presiously proposed super block compression schemes, however, is that compressed cache lines are always of the same size, as a result of using fixed length dictionary indices as code words, essentially only attempting a 4:1 compression ratio. This fixed length code word scheme has two obvious advantages. First, cache lines are statically mapped to only one possible offset in the data slot of the super block, since it is guaranteed that all four lines can be accommodated by a physical slot, as long as they can be compressed successfully. No dedicated storage management hardware is therefore needed in the cache controller. The second advantage is that no physical slot compaction is required when write back happens, since cache line sizes after compression is always 16 bytes, unless compression fails. In the case of a failure, the block that failed to be compressed must be invalidated in the current physical slot, and then allocated a new slot before compression is attempted.

Two types of compression algorithms are proposed, scheme I and scheme II. Scheme I takes advantage of 32-bit data locality, while scheme II takes advantage of upper bits data locality, as discussed above. In scheme I, eight values are selected for encoding one to four cache lines in the super block. The dictionary may even contain less than eight values, if the number of distinct values in the current physical slot is less than eight. Each dictionary entry is associated with a “valid” bit indicating whether the entry is valid or not. Each 32-bit word from uncompressed line is represented by a 3-bit index which points to the full word in the dictionary. A compressed line is therefore 3 * 16 = 48 bits or 6 bytes, and four compressed lines will take 24 bytes. The 8 32-bit dictionary entry takes another 32 bytes. An extra 8 bit vector is dedicated to the valid bit for dictionary entries. The total number of storage required for scheme I is hence 57 bytes, leaving 7 bytes for storing extra per-line metadata, such as dirty bits, valid bits, and coherence states. Scheme II, on the contrary, stores only 28-bits values as high bits in a compressed word. The low 4 bits are stored as-is in the compressed data body, which will be concatenated with the high bits to form the original word. A scheme II dictionary only contains 4 entries, totalling to 112 bits or 14 bytes. Each compressed cache line consists of 16 2-bit indices and 16 4-bit lower parts, summing up to a total of 384 bits, or 48 bytes. Overall, 62 out of 64 bytes in the physical storage will be used, leaving 2 bytes, or 16 bits to be used for other purposes.

We next describe the metadata and data layout. Each tag entry contains the super block tag as in previous designs, and per-line control information such as valid bits, dirty bits, and coherence states (they can also be placed or at least partially placed in the data slot, though). The compression type is represented using an extra 1-bit flag. Regardless of compression type, the physical slot is always divided into dictionary storage and compress line bodies. The dictionary occupy the first half of the physical slot, while the remaining part stores cache line 0, 1, 2 and 3, in this exact order. Cache lines are statically mapped to their offsets given the compression type, since DISH always attempts 4:1 compression ratio.

The access protocol of DISH cache is similar to the one of YACC. On a lookup request, the middle bits are used as set index to locate the set of the super block. The two lowest bits in the are not used for set indexing since they encode block ID within the super block. After locating the set, all tags are read out and expanded to cache line addresses in parallel, based on the per-cache line valid bits. A read hit will always return the cache line after decompressing it. A write hit, however, does not always indicate successful completion of the operation. Instead, the cache controller will first attempt to compress the dirty write back cache line. If the line could not be compressed with the current dictionary, its old copy will be invalidated (with dictionary entries only referenced by that line invalidated as well), followed by allocation of a new tag and data slot for storing the new line. If the new line still cannot be compressed with an empty line, it will be stored uncompressed. Evictions are made using the same algorithm as in previous designs.

The compression circult works as follows. If the physical slot that a line is compressed into is new, which contains no previous contents (or only invalidiated contents), the compression start anew with a new dictionary in which all entries are marked invalid. The compressor then starts processing the input line in the unit of 32-bit words (order of processing is insignificant). For each 32-bit word, if it is not already in the dictionary, then it will be inserted into the dictionary first. In both cases, the code word will be output as the index of the word in the dictionary. If the dictionary is already full before all words are processed, then 4:1 compression fails, and the line should be stored uncompressed. Otherwise, the compression succeeds with a dictionary and a fixed length compressed block, after which both are written into the physical slot at the corresponding offsets. If, however, the physical slot already contains some valid lines, the current dictionary will first be loaded into the compressor’s temporary dictionary, and compression is attempted as if it were half way performing compression. The line can incrementally add new entries to the temporary dictionary in the compressor, and also reference existing entries. If the temporary dictionary becomes full before the block is fully processed, compression fails, and the cache line will be allocated a new physical storage, after which compression on the new slot is attempted again. In the case of failure, the temporary dictionary is diacarded, and no update is made to the physical slot. Otherwise, both the updated temporary dictionary and the compressed body is written.

Decompression is easier than compression, as it only involves parallel table lookups of 16 index values. The lookup bandwidth of the dictionary structure determines the decompression latency. In the best case, in which all 8 dictionary entries can be mux’ed to all 16 outputs independently, decompression only takes 1 cycle, minimizing the performance impact of performing decompression on the critical data access path.

The paper also suggests that whether to use scheme I or scheme II should be determined by set dueling. 32 sets from the LLC are configured to always use scheme I, and another 32 sets always use scheme II. The dynamic performance of each set is then checked to decide which policy will be followed for the rest of the sets.