GBDI: Going Beyond Base-Delta-Immediate Compression with Global Bases

- Hide Paper Summary

Paper Title: GBDI: Going Beyond Base-Delta-Immediate Compression with Global Bases
Link: https://ieeexplore.ieee.org/document/9773255/
Year: HPCA 2022
Keyword: Memory Compression; BDI; GBDI

Back

Highlights:

Compressing four consecutive blocks in the DRAM as a sector, and prefetch them to the prefetch buffer for future accesses. The DRAM controller may also decide whether and how many to prefetch, using the compressed size and the prefetch buffer hit rate.
BDI can be extended to use global base from a table. Each compressed code word consists of an index value into the base table and the fixed-width delta. The code word can still be fixed-width in this case.
The global base table can be generated from software by value binning, and then selecting the most frequent values from each bin as the base.
Further optimizations can be made, such as compressing the base value indices with Huffman encoding, using narrower delta values, and combining GBDI with regular BDI.

Comments:

The paper did not evaluate recompression. Instead, it only shows that for a single workload, compressibility does not change much. The problem is that recompression is inevitable, if the workload consists of many different applications on a multicore system. In addition, context switch may be problematic, because the global bases are not part of any context. As a result, the more applications there are in the system, the more likely the number of global bases will be larger. The paper did not evaluate mixtures of different workloads. Instead, only the same workload is simulated. Broadly speaking, all designs that leverage inter-block compression have the same problem: The global bases (clusteroid for Thesaurus, pattern table for EPC, and the global base for GBDI) are not scalable, and are heavily bound to the execution context. This is an inherent trade-off of inter-block designs: The larger the search space is, the potential better compression you can get. But as the search space becomes larger, you need to keep more metadata for each context.
The algorithm is actually nowhere close to BDI. For example, immediate number compression is not used (because you can always have an explicit zero in the base table). The fixed-width code word design of BDI is also not there. The compression and decompression latency is much higher than original BDI which makes global BDI quite heavy-weight. None of these is fundamental problem, though.
I find it hard to understand what is “bin ranges” and how it relates to the number of bins. Does “15 bin ranges” mean 2^15 (i.e., 32K) bins for sampling? Are bins evenly divided across the 32-bit value domain? The paper should have done a better job describing the binning algorithm.
Does recompression needs to scan all of the DRAM contents? The paper mentions that this can be done just like DRAM refresh, so I guess it is just the DRAM controller issuing commands in the background and performing recompression.

This paper proposes Global Base-Delta-Immediate (GBDI), a main memory compression architecture leveraging inter-block compression with global bases. The paper is motivated by two limitations of intra-block BDI. First, BDI only leverages redundancy between similar code words within a 64 byte block, without looking into other blocks for redundancy. The restricted search space limits compressibility since a single block may not contain all the possible base values for efficient compression. Second, the paper also observes that floating point numbers are difficult to be compressed under BDI, because BDI interprets compression inputs as integers, and hence computes the arithmetic difference. For floating point numbers, however, this approach can be rather unreliable, because floating point numbers that are numerically close to each other may appear radically different on the integer value domain. This further reduces the effectiveness of intra-block base value searching for floating point numbers.

The GBDI design consists of two parts. The first part is the main memory architecture, which is only briefly described in the paper. The goal of main memory compression is to reduce bandwidth consumption of fetching data from DRAM and reduce the latency of data access. This is achieved by compressing four consecutive 64-byte blocks on the physical address space into a sector, which is stored on the physical location of the first block (and the rest of the storage is just wasted). The DRAM controller is extended with a prefetch buffer. When any of the blocks are accessed, its neighboring block will be prefetched into the prefetch buffer (the paper also mentions that this is not always the case, because the DRAM controller can determine not to prefetch depending on the hit rate of the prefetch buffer and the size of the compressed sector), such that on later accesses, they will be hit in the prefetch buffer without going into the DRAM again.

The compressed size of each 64-byte block is stored as metadata in a dedicated region of the DRAM. The compressed sizes are fetched also when a sector is accessed in order to delimit the compressed blocks in the sector. To avoid having to access the DRAM twice on every request, the paper proposes that the memory controller also maintains a 1K entry cache of metadata to achieve a metadata hit rate larger than 95%.

The second part of GBDI is the compression algorithm. GBDI relies on software to load a global table with predetermined bases and size of deltas (in the number of bits the delta is encoded with). Then at compression time, the compressor performs an associative lookup for every 32-bit input words in the table, and gets the index of the base value in the table. The associative lookup compares the word with all base values, and selects the base value that yields the smallest delta. If the delta can be encoded in the required number of bits, then the encoding succeeds, and the code word is stored in the output as the index and the delta value. To simplify the design, the paper proposes that the output code word in this case be 16 bits, meaning that for a base table of size 2^N, the width of delta for all bases is hard wired to (16 - N). In evaluation, N is 11, meaning that there are 2048 global bases. The delta, therefore, is of width 5 with the range being [-16, 15].

In practice, to avoid associative lookups against all entries of the base table, the table is partitioned into several smaller blocks based on the base values. Associative lookups are only performed on a particular partition, if the input word value lies within the range of that partition. This is implemented with per-partition bound registers that filter out the word if its value lies outside of the partition.

If the input word cannot be encoded with the base value (i.e., the delta size exceeds (16 - N)), the input word is an outlier, which is stored in the output unchanged. The output contains a bit mask indicating which words are outliers, and which are encoded.

Decompression works in the reverse direction as compression. For each code word, the decompressor performs a lookup into the base table using the index, and then adds the delta to the base value to restore the original value. Outliers are directly written into the output without any decoding.

The global base table is generated by background software using sampling. The sampling is performed by sorting sample values into different bins, each of which is responsible for the value within a certain range (the paper did not describe this part clearly, but I guess the bins are just equally divided across the 32-bit value domain). Then the software selects N bins that have the most value counts, and within each bin, the value with the highest occurrence counts is selected as a global base. The paper suggests that the sampling process is very short, where merely 200K values would suffice.

After the base values are generated, software loads these values into the compressor’s global base table. The decompressor, on the other hand, must keep two copies of the base table, one for blocks compressed using the global bases before the update, and another for blocks compressed after the update. When an update happens, the contents of the after-table is copied to the before-table, and the after-table is also loaded with the newly generated base values. The DRAM controller, in the background, constantly issues decompression commands for every DRAM row in a way similar to how DRAM is refreshed. The before-table can be discarded after a full round of recompression over all contents of the DRAM device. To distinguish between blocks compressed using the two tables, the compressed block has one bit to indicate which version of the global base table it uses. The DRAM controller similarly maintains a sense bit, which is flipped every time the table is updated. The sense bit is also written into the compression output.

The paper also proposes three optimizations to further improve performance. First, for those blocks that intra-block compression already yields satisfactory results, they are not compressed using global bases. Instead, they are compressed with regular BDI engine, and stored with two-bit metadata to indicate so. Second, for certain blocks, the size of delta for all values is smaller than the hard wired delta size. In this case, the delta can be encoded into smaller units, and the size is explicitly stored in the compressed block. The decompressor must recognize this case using the two-bit metadata, and decode accordingly. Lastly, in the case where only a few base values are used for the block, the base value indices themselves can be compressed using Huffman coding, which gives smaller code words than storing the full 11-bit base value index (assuming 2048-entry base table). The Huffman tree must also be stored in the output, such that the decompressor can correctly restore the base values.