Hardware-Assisted Data Compression for Energy Minimization in Systems with Embedded Processors



- Hide Paper Summary
Paper Title: Hardware-Assisted Data Compression for Energy Minimization in Systems with Embedded Processors
Link: https://dl.acm.org/doi/10.5555/882452.874530
Year: DATE 2002
Keyword: Memory Compression; Delta Encoding; diff123



Back

Highlight:

  1. Compressed main memory (or a region) can be implemented as (1) fully-associative, which requires address translation using either hardware CAM or other mapping structure, but does not need backing memory; (2) Direct-mapped, which must have another extra level of backing memory for replacement, but need no mapping table or CAM, just co-locate the tag with data; (3) Set-associative, which also does not require specific mapping structure, and it requires less replacement than direct-mapped, at the cost of more metadata read on each access.

  2. Compression could be as easy as just cutting off higher bits of a word, if it shares common higher bits with another reference word. There are a few variants of this, mostly on how the reference word is chosen and the degree of value locality. For example, the reference word can be the first word, the previous word, or some specific word (using an index).

This paper proposes a main memory compression technique for embedded systems. Embedded systems are often deployed in environments where energy supply is limited and hence should be utilized wisely. The paper is motived by previous proposals that perform instruction compression for reducing energy consumption, and it extends compression to data memory. The challenge, however, is that data memory must be optimized for both compression and decompression, as both will be performed online and are performance critical. Instruction compression, on the other hand, only needs to be fast on decompression, as instructions can be compressed off-line, and will unlikely be modified in the runtime.

The paper identifies two reasons why data compression helps saving energy. First, less data bits are streamed from the main memory on accesses, which can reduce power related to row buffer activities. Second, since blocks are transferred over the memory bus in compressed form, the bus also enjoys the energy benefit of compression, as less bits are being transferred. Despite previous studies which show that the extra hardware added for compression will offset the energy benefit of reduced data activity, the paper argues that compression can still be beneficial if the compression circuit is designed to be simple enough.

We next describe the overall architecture. The compression scheme works for data transferred between the LLC and the main memory. The proposed logic is implemented as a Compression and Decompression Unit (CDU), which is inserted on the datapath between the LLC and the main memory. Data written back from the LLC is first compressed before it is sent to the memory bus. Access requests to the main memory are also intercepted by the CDU, and the CDU fetches the requested block from the main memory in either compressed or uncompressed form. The CDU is put on the LLC side such that data transfers between the LLC and the main memory use compressed blocks whenever possible.

Memory blocks are stored in the DRAM in compressed form. Not all blocks are compressed, though. The paper proposes that part of the main memory be dedicated to storing compressed blocks (called a compressed region), while the rest still stores uncompressed data. The paper noted that this is equivalent to having a compressed DRAM buffer between the LLC and the uncompressed main memory. Blocks that are stored in the compressed region need extra address translation when being accessed, as the cache hierarchy only uses their canonical addresses in memory requests. The address translation can be a major bottleneck if not handled properly, since in the most straightforward implementation, we need a translation table in the CDU, which stores the addresses of all blocks in the compressed region, and their offsets, which incurs considerable metadata overhead. The paper proposes a simple address translation scheme as follows: The compression region is divided into fixed size slots, the size of which is less than a full cache line. Only blocks that can be compressed to below this size will be stored in the compressed region, while others are just stored in their canonical addresses. The address translation table, therefore, only stores a list of the addresses of blocks in the region, without maintaining the offsets for each block, as the offset can be computed by multiplying the position in the list with slot size. The translation table can be implemented as a CAM structure where each entry stores a block address, which outputs the index that hits the CAM given the requested address. The index is then used to compute the offset in the compressed region.

Using CAM for address translation can still be too heavyweight, as CAMs are power hungry. To entirely get rid of the translation table, the paper further proposes to co-locate address tags with compressed block data in the compressed region, rather than using an on-chip CAM. In addition, blocks and tags in the compressed region can be organized as either direct-mapped or ser-associative. Memory requests are then directly used to probe the tags in the compressed region just like how a SRAM cache is probed. If one of the tags hit, then the block is streamed out from the compressed region. Otherwise, it is accessed from the home location. This optimization, however, increases main memory activity, since all memory accesses need to check the compressed region first, largely offsetting the first benefit discussed above. The bandwidth benefit is still valid, though.

The paper then proposes two types of compression algorithms. The first type relies on offline value profiling to find the most frequent N values (N = 128 in this paper). These values are then stored on a CAM structure for lookup during compression and decompression. During compression, 32-bit words that are in the CAM will be represented with only log2(N) bits, while the rest are stored in uncompressed form. A bit vector header is attached at the beginning to indicate which word is compressed and which is not. Decompression logic uses this bit vector to interpret the data code words that follow. The paper argues that while value profiling may not be practical for general purpose computing, for embedded systems it may actually work pretty well, since only a small number of programs are executed.

The paper further proposes three more compression algorithms using delta encoding. The paper observes that many values in the same cache block are close to each other in numeric values. As a result, those values are likely to share the same upper bits. The first algorithm, diff1, takes advantage of this by comparing the 2nd, 3rd and 4th word in the block (the paper assumes 4-word block) with the first word, and computes the number of bits in the longest common prefix. The length of the prefix is stored in a 5-bit field, and for the last three words, the prefix will not be stored (the first one is still stored as 32-bit word). On decompression, the prefix will be added back to the last three words.

The paper also proposes two variants of diff1, namely diff2 and diff3. Diff2 uses three length fields for word 2, 3 and 4. The first length field encodes the length of the common prefix between the first and the second word. The second length field encodes the length of the common prefix between the second and the third word, and so on (i.e., each word use the previous one as the reference word, and has its own length field to encode the prefix length). Diff3 is similar to diff2 in a sense that each word except the first one has its own length field. The difference is that in diff3, all words use the first word as the reference word to compute the prefix, rather than using the previous one.