Data Locality Exploitation in Cache Compression



- Hide Paper Summary
Paper Title: Data Locality Exploitation in Cache Compression
Link: https://ieeexplore.ieee.org/document/8644558/
Year: ICPADS 2018
Keyword: YACC; DISH; Dual-Block Compression



Back

Highlights:

  1. Use an uncompressed block both as cached data and as dictionary for other blocks in the same super-block;

  2. DISH dictionary encoding algorithm can be extended by adding more dictionary entries. The dictionary cannot fit within the same block as compressed data, so we need another slot to store the dictionary, which also happens to be the uncompressed master block.

  3. The search for master block need not be restricted to be within the same super-block. In fact, adjacent super-blocks may also have good value locality, and blocks can be picked from these “buddy super-blocks”. This establishes data dependencies between super-blocks that are on different sets, so the dependency should be as simple as possible (i.e., only searching from the adjacent block), and it should be tracked (with FC field) such that when the master is modified or evicted, the corresponding actions will be taken.

Comments:

  1. The authors may have used “block” and “slot” interchangeably, which is confusing. For example, in Section 3.1 it says “Dual-block compression is attempted when multiple blocks in the same sector are not compressible into a single block.”. I guess the second “block” should be “slot”.

  2. This paper is really vague on the condition in which Dual-Block Compression is triggered (Section 3.1). To my understanding, it must be that the compression benefit outweigh the current storage usage. For example, if the current layout is that (assuming ABCD are from the same super-block) A, B is compressed together in the same slot, and C is not compressed, then applying Dual-Block Compression makes no sense. However, if block D is also uncompressed, then the scheme should be attempted since it would reduce the number of slots occupied by this super-block from three to two. The paper should really elaborate on this part.

  3. What if a new logical block from the same super-block is to be inserted into the LLC, especially when compression fails on the new block (i.e., it could not be compressed, or it does not fit with other companion blocks)?

This paper proposes Dual-Block Compression, a YACC-based cache compression technique that leverages inter-block redundancy for higher compression ratio. The paper is motivated by a third type of locality: data value locality, in addition to the well-known spatial and temporal locality. Data value locality suggests that there is a high chance that values in nearby memory locations will contain identical or regular values, as a result of loops in the program. As a result, explicit dictionary-based algorithms (i.e., dictionaries are generated dynamically based on the data pattern and co-located with compressed block data, rather than being implicitly encoded in the compressed block) work pretty well on these data patterns, since only a few values are sufficient to encode an entire block. The paper further notices that the pattern can extend across several memory blocks, making it possible to share the same dictionary for these blocks. This approach has a clear advantage of avoiding storing multiple dictionary entries for each block, potentially achieving a higher compression ratio.

The design presented in the paper is based on YACC, a super-block based compressed cache. For simplicity of implementation, YACC uses a coupled tag and data array, meaning that each tag entry is statically mapped to a 64 byte data slot, and data of logical blocks encoded by the tag entry must be stored in the corresponding data slot. Cache blocks are compressed before insertion and decompressed before eviction and access. Each tag entry encodes one to four logical blocks in the same super-block, which consists of four consecutive blocks in the physical address space. The entry only has one address tag, which is aligned to four-block boundaries, and four copies of status bits, such that each logical block can still function as an independent block. The index generation function of the cache is skewed, such that all four blocks within the same super-block are mapped to the same set. In addition, there can be multiple tag entries allocated to the same super-block, if the compressed blocks need to be stored in more than one data slots.

The original YACC assumes a single-block compression algorithm, and it processes blocks individually without considering any inter-block redundancy. Previous works such as DICE has explored the design in which multiple blocks in the same super-block are compressed together using the same dynamic dictionary, leveraging inter-block redundancy as well. The dictionary is shared by all blocks currently in the slot (one to four blocks), and it is constructed by scanning the blocks for unseen “keys”, i.e., 32-bit words that are not in the dictionary. Data blocks are encoded by storing a pointer to a dictionary entry. Optionally, a small delta can also be stored with all encoded words. Since the dictionary is small (8 entries), and all code words are of fixed size, decompression can be pretty fast and complete in just a few cycles.

Dual-block compression improves DICE by observing that, in some workloads, 8 keys are not sufficient to encode all the distinct values in four blocks. In the experiment, it is also shown that these “hard” workloads actually favor 9 - 24 keys in order to cover all the distinct values. Increasing the number of entries in the dictionary, however, is impossible for DICE, since the compressed data would otherwise be significantly larger than 64 bytes. To deal with this, the paper proposes that, instead of co-locating the dictionary with compressed block in the same slot, two slots can be used, and one of them stores an uncompressed block in the super-block, which serves both as cached data, and as the dictionary for the remaining three blocks. The remaining one to three blocks are still encoded in DICE’s format, except that each pointer now can refer to 16 + 8 = 24 entries, and thus thus width of the pointer increases, and are stored in the second slot. The eight-entry dictionary in original DICE still co-locates with the other blocks, and it will also be referred to by compressed code words.

The dual-block scheme works as follows. When multiple blocks from a super-block can be compressed into the same slot, it just works as the baseline YACC cache without any need for inter-block compression, as the compression ratio is already satisfactory. If, however, blocks from the same super-block cannot be compressed into the same slot, and there are already at least three of them (two makes no sense, since after compression we would still use two data slots), dual-block compression will be attempted by first selecting a block, called the “master block”, as the dictionary, which is stored uncompressed in a separate slot, and the remaining blocks, called “companion blocks”, are compressed using the master block as well as the eight-entry co-located dictionary as references. As mentioned earlier, since there are 24 dictionary entries, each pointer needs 5 bits.

When the master block is modified by a write back, all the remaining blocks should be decompressed first, and then re-compressed as the dictionary changes. If re-compression fails, then regular compression is attempted. In addition, if a new logical block from the super-block are to be inserted, compression should first be attempted on the new block. If the new block still fits with other companion blocks, compression succeeds. The paper does not mention what will happen if compression fails, but it is reasonable to assume that the block will just be stored uncompressed in a separate slot (or maybe try selecting another master block?). All companion blocks are also evicted, if the master block is to be evicted.

The paper also observes that for some workloads, the super-block locality is low, meaning that only one or two logical blocks are present in the LLC for each super-block for most of the time. Dual-block compression does not work in this case, as it requires at least three logical blocks from the same super-block. To address this issue, the paper proposes that the master block can be from both the current super-block, and the “buddy super-block”, which is defined as either the previous super-block on the address space, if the current super-block has an odd block number, or the next super-block if the block number is even. These two super-blocks are mapped to adjacent sets by the controller, since the controller extracts lower bits of the super-block number to form the index. In this scheme, when Dual-Block Compression ia attempted, the controller will probe the other set for a cached buddy block, and if it exists, also attempts compression using the buddy set’s block as the master block.

To represent the various status a block can be in, the paper uses a three-bit Function Code (FC) to encode the master/companion status and the presence of a buddy set that has data dependency with the current block. The FC also encodes block status such as uncompressed and individually compressed.