DPFC: A Dynamic Frequent Pattern Compression Scheme in NVM-Based Main Memory

- Hide Paper Summary

Paper Title: DPFC: A Dynamic Frequent Pattern Compression Scheme in NVM-Based Main Memory
Link: https://ieeexplore.ieee.org/document/8342274/
Year: DATE 2018
Keyword: NVM; FPC; DFPC; Compression

Back

Highlights:

Zero compression can be applied at 4-bit nibble level using FPC, and the actual pattern can be expressed using an 8-bit mask over 32-bit word. Only zero nibbles are compressed, and non-zero nibbles are stored in the payload.
Different applications have different distribution of zero nibbles on 32-bit words, and the distribution is pretty consistent across the execution. Unfortunately, this property cannot be leveraged with static patterns as each application has its own signature distribution.
We can train dynamic patterns by simply counting the occurrences of each pattern and selecting the ones that yield the most compression benefits.
FPC can utilize dynamically generated patterns to better adapt to different workloads.

Comments:

The pattern table, especially the dynamic one, should be saved on a context switch to allow multiple processes to use DFPC and each of them to have its own dynamic pattern. Since not all processes use DFPC, one optimization is to only lazily swap it out when the process to be swapped in indicates in its PCB that it intends to use the pattern table. This is similar to how the FPU states are saved in today’s OS.
The author may need to polish and proof-read certain sections. Some terminologies are weird, e.g., a 4-bit unit is not a “character” which typically refers to 8-bit ASCII code or int8_t variables. “Nibble” might be a better option.
I do not get why there are only 128 counters? The pattern mask has 8 bits, so if you are counting the occurrences of all possible masks, I guess you need more than 128 counters?

This paper proposes Dynamic Frequent Pattern Compression (DFPC), a data compression scheme for reducing NVM wear. The paper is motivated by the fact that current NVM devices have limited number of write-erase cycles, and that excessive writes not only consume bandwidth, but also harm its lifespan. Previous works attempt to address this issue with two techniques: Compression and Flip-and-Write (FNW). Cache blocks are first compressed to reduce the number of bits to be written into the device, and compressed data is either as the original binary or as a bit complement of the original to minimize the number of bits flips. In such an architecture, each cache block sized data on the NVM is tagged with 2-bit metadata, one bit for indicating whether the block is compressed, and the other bit to indicate whether the data is stored as flipped.

This paper, however, points out that static Frequent Pattern Compression (SPFC), which is used by previous work, fails to compress certain patterns that occur frequently in the run time. The classical SFPC algorithm divides the input stream into 32-bit words, and only compares these words with certain pattern masks (e.g., 00XX, where 0 means the byte has a literal value of zero, and X means do not care which will be encoded in the output code word). The paper observes that the static patterns used by SFPC do not capture many frequent patterns that are pervasive in some of the workloads, despite that the latter is quite consistent and compressible. The fact that each application has its own set of frequent but “non-standard” patterns complicates the issue, since we cannot simply improve SFPC by adding new static patterns.

To address this limitation, DFPC allows application-specific patterns to be trained dynamically using runtime data, and employed by the FPC algorithm for higher compression ratio. The training is based on the observation that the distribution of zero bits in 32-bit words is quite consistent over the execution, which can be both easily identified and utilized. The paper, therefore, proposes that the patterns should be trained in the unit of 4-bit “nibbles”, i.e., for any 32-bit words, the training logic monitors the distribution of zeros across the eight 4-bit nibbles, and selects the ones that benefit from compression the most. The selected patterns are then entered into a dynamic pattern matching table, which functions just like static patterns, and are used to encode and decode data exchanged between the hierarchy and the NVM.

We next describe the operation of DFPC as follows. The DFPC framework allows each individual process to have its own pattern matching table of size eight. Four of the patterns are statically determined and cannot be changed, which are simply just pattern from the classical SFPC algorithm. The remaining four patterns are trained dynamically from the runtime, which will be entered during execution, and will never change once they are entered to avoid expensive re-encoding. The DPFC algorithm works the same as SFPC: Data to be compressed are divided into 32-bit words, and for each word, it is compared with the patterns in the pattern table, and if one or more matches are found, the word is compressed into a three-bit header indicating the patter and an payload stream for bytes that are not encoded by the pattern itself. Decompression is just the reverse of compression which reads the header and the payload, and reconstructs the word based on the pattern. Note that DFPC patterns only compress zeros in the granularity of nibbles, and they are encoded as a 8-bit mask. A “1” in the mask means that the corresponding nibble is zero, and need not be stored in the payload area, while a “0” in the mask indicates that the nibble at that location is non-zero and will be stored in the payload.

Initially, only four hardwired static patterns are in the table. As the program executes, the training logic constantly monitors the data stream and attempts to capture dynamic patterns. In the first phase, the training logic counts the occurrences of different patterns that are not compressible by the static pattern, and stores statistical data in an array of counters (the paper claims that 128 counters are sufficient, but does not give any further explanation). Then in the second phase, the training logic selects the top-four patterns that yield the highest compression benefit, and enters them into the pattern table. The top-four patterns are selected based on both the frequency that they occur (indicated by the occurrence counts) and the number of zeros nibbles in the pattern. (Some equations are derived in the paper for precisely computing the score, but I did not quite get those, and I just assume they are reasonable approximations of what I described above). The paper also noted that some patterns should be excluded, such as those already in the static table, and the compressible pattern (none of the nibble is zero). In the last phase, the dynamically patterns are entered into the table, and the DFPC algorithm is switched to utilize all entries of the pattern table. Previously encoded blocks are unaffected, since the static patterns do not change.