Design and Performance of a Main Memory Hardware Data Compressor



- Hide Paper Summary
Paper Title: Design and Performance of a Main Memory Hardware Data Compressor
Link: https://ieeexplore.ieee.org/document/546466
Year: EUROMICRO 1996
Keyword: Compression; X-Match; X-RL



Back

Highlight:

  1. LRU dictionary replacement with shifting CAM

  2. Runs of zeros can be encoded using a special dictionary entry followed by the length of the zero run

  3. Using “phased binary” representation of dictionary index based on the current dictionary size avoids having fixed sized index field, which helps reducing bits before the dictionary fills up.

  4. Using static Huffman code to encode match type

Questions

  1. There is no straightforward way of parallelization, although the paper is not meant to develop a fully scalable algo.

This paper proposes X-Match and X-RL, a memory compression and decompression algorithm with harrware implementation. The paper identifies three major problems with previous software compression algorithms. First, traditional software compression algorithms tend to introduce large data blowup at the initial stage of compression, since these algorithms are designed for streams of large size, while memory compression works on smaller blocks. The second problem is that software encoding and decoding are designed to be asynchronous with input and output, meaning that they often consume and output bits without a fixed bit rate. Hardware implementations of these algorithms require special interfacing or buffering with other components, since these hardware components most likely expects the same number of bits per cycle. The last issue is that most software algorithms are not meant to be mapped to hardware. Their designs more or less include parts that are not easily transformed into hardware circuits.

From a high level, X-Match is a dictionary-based encoding and decoding scheme using fixed input and output granularity. The input stream is consumed by compressor in four byte granularity, and the decompressor also outputs data in four byte granularity. On the compressing side, a dictionary of four byte words are maintained. Words are compared with dictionary entries to determine whether they can be represented in a shorter form using a combination of dictionary entries and raw literals. The results are packed as variably sized codewords consisting of a few fields for identifying the type, dictionary entry, and/or raw literals. On the decompression side, encoded codewords are consumed field-by-field. The decompressor also maintains a dictionary the same way as the compressor, the content of which is a function of words that have been compressed/decompressed. The decompressor outputs the original word by combining dictionary entries with raw literals (if any) in the codeword.

We next describe the compression algorithm in details. As mentioned above, the compressor maintains a dictionary of four byte words for full and partial matching with incoming words. The size of the dictionary is unspecified. A larger dictionary may result in better compression for data with distant locality (i.e. symbols tend to repeat on large distances), at the cost of using more bits to encode dictionary index. The input word is compared with all dictionary entries, which may result in full match or partial matches. A full match is the case where all four bytes are identical, while in a partial match, two or more bytes at different locations are identical, with the remaining bytes not matching the entry. X-Match allows partial mapping to be also encoded using dictionary entries to attain better compression results. The codeword for a dictionary match is encoded as a “1” bit, followed by the index of the dictionary entry, followed by the match type, which is then followed by byte literals that are not matched, if any. The index of the entry is encoded using log2(k) bits, where k is the current size of the dictionary (before inserting the current word). Such “phased binary” representation of dictionary indices reduce the number of bits required at early stages when the dictionary is not yet full. The matching type is also encoded with static Huffman code, with the more common types of matches using less number of bits. The Huffman code is hardwired into the algorithm, which is generated based on empirical data. Multiple partial matches may qualify for a single word. In this case, priority is given to the full match entry, and then a partial match entry with the least number of byte literals, and then partial match types with the smallest number of bits to minimize the codeword size. If no match can be found, the codeword is encoded with a “0” bit at the beginning, followed by the word literal. This guarantees that in the worst case and in early stages of the algorithm, where no matching can ever be found, the encoded stream will only be slightly larger (1 bit overhead for each 32 bit word) than the original.

As codewords are generated, the content of the directory is also constantly updated. For every word that is not a full match, the word is always inserted into the first place of the dictionary, moving all existing entried down by one. If the dictionary was full before insertion, then the last entry will be evicted. Even on a full word match, the entry being hit will also be promoted to the first location of the dictionary, which mimics the behavior of LRU replacement policy.

The paper also proposes an optimization of X-Match based on the observation that runs of zeros are common pattern in the input data stream. The base dictionary algorithm is not really optimized in this case, since it only processes the stream word-by-word, and each zero will result in a codeword, which is unnecessary. As an optimization, a special entry that cannot be evicted is pre-initialized in the dictionary before compression and decompression. The compressor scans runs of zeros before starting to encode the current word. If a run of zero can be found, the codeword with the special entry’s index is generated, which is followed by the length of the zero run. During decompression, if this entry is seen, the decompressor will interpret the next value as the length of zero runs, and then simply output zero. The optimized algorithm is called “X-RL”, with “RL” standing for “Run Length”.

The X-Match algorithm can be easily translated to hardware circuit. The core of the compressor and decompressor is a CAM array serving as the dictionary. The CAM should support fully associative lookup, producing both full and partial match results. The CAM should also support shifting operation, which shifts all entries down by one. The shift operation can be implemented with virtual index: Instead of using the physical location as index, entries will not be physically moved unless they are evicted. Each entry maintains an “index” field, which is incremented by one on a shifting operation. The entry with the largest index will be replaced by the new entry, whose index is set to zero.

The decompressor is slightly more complicated than the compressor. The dictionary is maintained in the same manner as the compressor. The difficulty with the decompressor is that bits must be read sequentially, since all fields except the first bit is variably sized, the length of which depends on previous fields. To solve this issue, the paper proposes that the decompressor be implemented with pipelines. At each pipeline stage, the decompressor reads the next field and assembles the output word. The decompression latency from the arrival of the first codeword and the generation of the first decompressed word is four cycles. One word can be decompressed per cycle after the initial four cycle latency. The paper suggests that X-Match performs better than previously proposed hardware based on LZ77, since X-Match generates decompressed words in 32 bit granularity, while LZ77 processes data in byte granularity.