Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures

- Hide Paper Summary

Paper Title: Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures
Link: https://dl.acm.org/doi/10.1109/ISCA.2016.37
Year: ISCA 2016
Keyword: Bit-Plane Compression; Compression; GPGPU

Back

Highlight:

Good observation that preceding zeros in delta-based compression can be utilized in a better way by using bit-planes. More zero bits can be extracted and placed together this way, resulting in higher compression ratio.
Also good observation that occasional “abnormal” values will result in two consecutive ones in the bit-plane words, which is compressed using a special code.
I like the discussion of Fixed Width Compression and Variable Width Compression

Questions

The paper is poorly organized. There is neither overall picture of what the algorithm attempts to achieve, nor a detailed description of how compression is performed. I read seven pages, and most of them are just related work (which I already know) and observations by running different workloads (not even interesting, all expected).
The paper also failed to mention that this algorithm is only applicable to 128B blocks, since there has to be 32 words to form 32-bit code words after bit-plane rotation. Although the paper did mention that this algorithm is targeted at GPGPU which may have larger cache blocks due to the regular access pattern.
The paper also does not summarize what kind of inefficiencies in previous approaches it solves. At least give a comparison with BDI like what I did in this paper summary.
The claim that one abnormal value will result in two consecutive bits in the code words is over-simplified. Imagine four consecutive words before compression: 0000…001, 0000…001, 0001…001, 0000…001. (i.e., the third value is abnormal by having a “1” high bit). After taking delta we will get the sequence: 0000…000, 0001…000, 1111…000. In this case, the code word after flipping the bit-plane will be: …001…, …001…, …001…, …011…, …, …000…, …000…, …000…. After the XOR operation, the results will be: …001…, …000…, …000…, …010…, …, …???…, …000…, …000…. So actually, the results will be only one “1” bit, instead of two.

This paper proposes Bit-Plane Compression, a data compression algorithm for better compression ratio. Conventional compression algorithms, such as Base-Delta-Immediate (BDI), may take advantage of value locality of adjacent words by subtracting the value from its neighbor. Since values are expected to be close to each other, these deltas are small, such that only a few bits is sufficient to store them, achieving the goal of compression that fewer bits are used to encode the same information.

Although the paper did not point this out, conventional algorithms such as BDI have a few issues which limit their compression ratio. First, BDI is not always optimal in terms of reducing the number of bits requires to store a delta value. Theoretically speaking, BDI is able to compress delta values to the smallest bit length possible by allowing arbitrarily sized code words. In practice, this is infeasible, because: (1) the bit width of the code words must also be encoded in the final result; (2) This also puts a heavy burden on compressor circuit to detect all possible bit lengths. As a result, BDI only supports a limited set of compressed code word lengths, e.g., 8 bits, 16 bits and 32 bits, which leaves some redundant zero bits in the output code words. In the worst case give the size possibilities above, if all code words can be encoded with 17 bits, BDI will choose the 32-bit compression scheme, resulting in 15 redundant zero bits for each code word.

The second issue with BDI is that it only selects the first value, not the best fitted one, as the base. This is another design compromise that was made to avoid exhaustively searching for the most appropriate base value. It will, however, significantly lower the compressibility of the entire cache line, if the first value is not a proper base. This occurs, for example, when the array starts mid-line or it is an occasional abnormal value. In this case, all delta values will be too large to compress, while still being close to each other and having abundant redundancy.

The last issue is that BDI cannot handle occasional abnormal values well. If such a value occurs, the abnormal delta will determine the bit width of compressed code word, making compression difficult or impossible.

Bit-Plane compression avoids the above issues by performing a bit-place transformation after computing deltas. The transformation generates a new set of code words, where code word i is comprised of all bits at offset i from the delta values. The generated code word will always be zero, if bit i from all delta values are zero, meaning that all high zero bits can be transformed into value zero, solving the first issue above. In addition, an abnormal value will result in a few “1” bits in the generated code word. These sparse “1” bits can be encoded with frequent pattern compression, solving issue three. Issue two is also solved by the same reason, since large delta values tend to have less alternation on higher bits, given that these deltas are close to each other (i.e., second-order value locality), resulting in code words full of ones or zeros.

One particular problem with only bit-plane flipping is that when the deltas have an alternating positive and negative number pattern, the generated code words by flipping the bit-plane will not be all-zeros, but instead, identical code words with non-zero value. To solve this, the paper proposes that all code words be XOR’ed with its previous value (except the first one, which is stored as-is), resulting in a sequence of all-zeros, regardless of the alternating pattern.

In the last step of the compression pipeline, the hardware further compresses each code word based on the pattern. Different from previous proposals, the hardward does not seek to compress large integers or irregular patterns, as their entropy has been reduced to a low level by the previous transformations. The most common values in the code word stream is zero, which is compressed with run-length encoding. The compressor maintains a counter for the current number of zero code words, which is incremented for every zero code word in the stream, and written to the output stream when the non-zero is not seen. The hardware also has seperate encodings for all-ones, single “1” bit, adjacent double “1” bit, and code word in non-XOR’ed form (if the value is zero before XOR’ing, then we only store zero, since it is more efficient to store as many zeros as possible). Each of these patterns has a unique prefix, which is optionally followed by arguments. Note that single “1” bit and double “1” bits represent two special cases. In the first case, the base value changes mid-way through the delta transformation, resulting in significantly different deltas. In the second case, there is one abnormal value in the input stream, resulting in two significantly different delta. Code words that cannot match any of these patterns are stored uncompressed.

Overall, Bit-Plane Compression is supposed to be performed on GPGPUs, where the cache block is larger than those on the CPU. This paper assumes 128 bytes cache blocks, which can be treated as 32 32-bit words. Delta is taken between a word and its previous word, resulting in 31 deltas (the first value is directly compressed with FPC as in the last step). Bit-plane compression is then applied by taking the bit from offset i of delta j to form bit j of code word i, and then XOR’ed with the previous value. The first value is treated as the first value in delta compression. Note that after bit-plane compression, the first value (formed by the LSB of all deltas) is likely uncompressible, and therefore, storing it uncompressed will not affect compression ratio. In the last step, all code words are encoded with FPC as discussed above, and written to the output buffer.