Exploiting Inter-block Entropy to Enhance the Compressibility of Blocks with Diverse Data

- Hide Paper Summary

Paper Title: Exploiting Inter-block Entropy to Enhance the Compressibility of Blocks with Diverse Data
Link: https://ieeexplore.ieee.org/document/9773200
Year: HPCA 2022
Keyword: Memory Compression; EPC; Pattern-Based Compression; Attache

Back

Highlights:

Redundancy between blocks can be utilized by having a machine-generated pattern table that stores commonly used global patterns.
Redundancy within a block can be reduced by: (1) Shrinking data width, which is common in other compression designs as well; (2) Flipping the value from negative to positive, and remembering the sign bit; (3) XOR’ing non-zero values to extract even more bit-level redundancy.

Comments:

Does the XOR stage work well, if the first non-zero value of the block does not have a common high-entropy region as the rest of the words?
The hardware overhead is not as small as the paper suggests. For example, 8 copies of the compressor and the decompressor circuit is needed. In particular, the last XOR stage cannot operate individually on each 64-bit word. Instead, it must operate over all 16 32-bit words. Single cycle latency may not be achievable in this case. In addition, the 256-entry pattern table as a CAM can be prohibitively expensive to implement, and the lookup latency can be higher that what the paper shows, not to mention the 512-entry candidate table.
Is the pattern table context switched (likely not). In this case, does performance degrade with more applications (likely yes, but not evaluated). Most global-pattern based designs have the same issue, though.
What happens when a new pattern is to be inserted, but pattern table has no free entry? I guess the only sensible solution is to not insert it, totally wasting the compression opportunity. By the way, if the pattern table is not context switched, I could not see how profiling works, unless the entire machine runs the same workload with the same data pattern, which is highly unlikely.

This paper proposes Entropy-Based Pattern Compression (EPC), a novel main memory compression and decompression pipeline that utilizes global common patterns for higher compression ratio than single-block compression. The proposed design compresses memory blocks to be stored in the DRAM using a pattern table, which is trained from the data stream being written into DRAM earlier. This leverages global pattern information, and enables compression to work using inter-block redundancy. The pattern table stores common bit patterns that are found in the data blocks, and compression works by replacing the same pattern in the incoming block with indices to the pattern table entries. Furthermore, the paper proposes three optimizations in the compression pipeline for transforming data words, hence increasing the chances that incoming data blocks can be compressed using the pattern table.

The EPC design uses the Attache architecture to store compressed blocks in DRAM internal rows and to track metadata. To elaborate: The attache design divides a DRAM row into two sub-rows (sub-ranks) that can be activated separately. If two consecutive rows are compressed to less than or equal to 30 bytes, then the two rows are able to be stored within only one single row, and can be accessed separately by activating different ranks, hence doubling the bandwidth if both lines are accessed. Rows that cannot be compressed to <= 30 bytes remain uncompressed, and are stored in the first rank of the original row. Each compressed row also has a 2-byte magic marker prefixing the actual compressed content. The magic marker is used by the DRAM controller to distinguish between a compressed row stored in a sub-rank, and an uncompressed row. In the rare case where the magic marker conflicts with the first two bytes of an uncompressed row, an extra bit per row is used to further indicate whether the magic op is a result of value aliasing, or indicates an actual compressed row. The extra bit is stored in a dedicated region in the DRAM, only constitutes a small fraction of total storage, and is only accessed rarely. Note that the attache design only aims at increasing DRAM access bandwidth, rather than saving any physical storage, and so does EPC.

EPC extends attache by adopting a different compression and decompression pipeline that utilizes global pattern information. The compression pipeline consists of four stages that process input data in the unit of 64-byte words, and compression is performed on 64-byte blocks, which is the unit of data exchange between DRAM and the cache hierarchy. In the first stage, the compressor identifies the data types stored in the 64-byte word. The paper lists several combinations of narrow integer values, pointer values, and floating point values, each with a different number of “high entropy” lower bits. This stage uses five detectors to determine the number of high-entropy bits in parallel, by counting the number of leading zeros and ones in each of the 32-bit words. The output of this stage is the number of high-entropy lower bits, which is fed to the input of the second stage together with the two 32-bit words.

Then in the second stage, the compressor detects negative values, and flips them to the positive. The operation is relatively simple: Just check the sign bit, and then take two’s complement of the low-entropy bits. The sign bit is also preserved as part of the output in order for the decompressor to restore the original value. The purpose of this stage is to reduce the number of patterns, since k and negative k are both mapped to the same pattern k (where k is a positive value). The output of this stage consists of all from the first stage, plus the sign bit.

In the third stage, the number of patterns are further reduced by XOR’ing between non-zero low-entropy bits. This stage consists of two steps. In the first step, zero values from the previous stage are omitted, and indicated by a “1” bit in the output metadata. Non-zero values are XOR’ed with the first non-zero value, and are indicated by a “0” bit in the output metadata. The values after XOR are also stored in the final output, in additional to the metadata.

The paper argues that the last step helps reducing the number of patterns in two ways. First, if the block is under initialization, then many values would be zeros, indicating that zeros should be treated differently. Second, in the case that the block contains a mixture of different types, this can at least extract more redundancy between the first non-zero word in the block and the rest of the words.

In the last stage, the patterns are compressed with regard to the pattern table. The pattern table is a reference-counted table of patterns that have been previously. A match in the pattern table will cause the corresponding word to be replaced with the index into the pattern table (which is presumably smaller than the value itself), hence achieving pattern compression. Unmatched data is stored in the form as is in the output from the previous stage, which may still be shorter than the full 32-bit original word.

The pattern table can be generated from profiling, or by the hardware. In the former case, the profiling program monitors the memory contents of the application at fixed intervals, and selects commonly occurring patterns to be statically loaded into the table. In the latter case, the DRAM controller maintains a candidate pattern table consisting of patterns that have been seen previously, and is replaced in an LRU fashion. Each table entry has a reference counter which is incremented when an input block contains the pattern. The pattern in the candidate table is moved into the pattern table (which is smaller than the candidate table) if the counter value exceeds a threshold. The pattern table is also referenced counted. The counter is decremented when a pattern no longer exists (which can be tracked when they are modified in the LLC). A pattern can only be freed if its reference count is zero.