Energy-Efficient Frequent Value Data Cache Design

- Hide Paper Summary

Paper Title: Energy-Efficient Frequent Value Data Cache Design
Link: https://dl.acm.org/doi/10.5555/774861.774883
Year: MICRO 2002
Keyword: Cache; Compression; FVC

Back

Highlight:

Probalistic sub-banking is a good way of reducing power consumption
Some tasks can be moved from the controller side to one of the pipeline stages to avoid adding extra cycle penalty
This design only saves energy, but does not conserve cache area.

Questions

Did not mention how training is started per-application. Does not ISA provides a way of starting training?
It seems difficult to abandon all entries in the dictionary and encoder CAM and start over when they are no longer accurate. How would you deal with a dictionary flush? Do you iterate over all entries of the cache and decompress them?

This paper proposes frequent value cache (FVC), a cache compression design aiming at reducing energy consumption. The paper identifies in the beginning that cache systems constitute a significant part of process’s energy consumption. In a conventional cache design, each cache access must read, potentially in parallel, both the tag array and the data array. FVC seeks to reduce the amount of storage that needs to be activated per access by taking advantage of value locality that is frequently observed in some workloads. The paper points out that many cache and memory accesses actually only read a small subset of frequently occurring values, rather than reading random values evenly distributed on the address space. Several factors may contribute to this observation, such as data initialization, small counter values, or pointers to commonly used data structures.

To leverage such a highly localized data usage pattern, FVC adds a dictionary storing high frequency values trained from a window of execution of the current application. The data bank of the cache is assumed to be stored in 32-bit sub-banks. Frequently value detection happens on a 32 bit word boundary. Each 32 bit bank is further divided into two smaller sub-banks, each being able to be activated independently. Given N frequently used values in the dictionary, the bits to represent a dictionary entry takes log2(N) bits, denoted as n. The first sub-bank of the 32-bit word consists of n bits of dictionary code, plus an extra bit indicating whether the 32-bit word is compressed as a dictionary entry or not. The second sub-bank of the word has (32 - n) bits, which represent the rest of the word if it is not compressed as a frequent value. Each of the 32-bit words can have two states in the runtime. If the “compressed” bit in the first sub-bank is on, then the word is considered as compressed, and only the first sub-bank is accessed, reducing the number of bits activated for each cache access. If, on the other hand, the “compressed” bit is off, meaning all 32 bits are used to represent an uncompressed value, the second sub-bank will then be accessed one cycle later after the bit check, increasing access latency by an extra one cycle.

To decode a compressed value represented as n bit code word, the cache controller is equipped with a hardware dictionary for mapping n bit code words in the first sub-bank to frequently used values. The hardware dictionary is implemented as a multiported register file indexed by the code word. The uncompressed value is stored as an entry in the dictionary, which will be output when its index appears on one of the input ports. For simplicity of discussion, we assume that the dictionary is already initialized. After initialization, the contents of the dictionary remain static. No write port is needed to alter an individual entry, except a special input port for the bulk initialization.

We next describe the operation of the cache. On a read operation, the cache set is selected as usual. After selecting the set, the controller reads the tag array and data array in parallel. Instead of reading a full 32 bit word for each cache line in the set, only (n + 1) bits from the first sub-bank is read. The cache controller then, in the rest of the same cycle, checks the one bit “compression” flag to see whether the value is compressed, and decodes the value by querying the dictionary if the bit is set. Note that the paper assumes that tag access is slower than accessing both the (n + 1) bit sub-bank and the dictionary serially (probably because the tags are longer than (n + 1) bits), allowing the controller to overlap tag access and value decoding. If, however, the word is not compressed, then in the next cycle, the second sub-bank is accessed to fetch the rest of the uncompressed bits. Two optimizations can be applied. The first optimization is that since at the end of the first cycle, the result of tag comparison is already known. If a cache miss is signaled, the second bank no longer needs to be accessed. If a cache hit is signaled, then only the sub-banks for the cache line on the hit location is accessed, further reducing power consumption. The second optimization is based on the observation that it is likely that some words in the cache line is compressed, and some not. In such cases, it always takes two cycles to fully decode the line, which degrades performance, since this is likely the majority of the case. As an optimization, the critical word can be returned to the load unit first, if the critical word is compressed, which is available at the end of the first cycle.

FVC handles writes differently than reads. Instead of letting the cache controller querying the dictionary at write time, which adds one cycle penalty to the write critical path, FVC adds a CAM module to the backend of the pipieline to encode the 32 bit word to be written while the instruction traverses through the pipeline. The content CAM module is kept consistent with the cache controller’s dictionary for decoding (see below). When a store instruction has its data operand ready from the ROB, the data operand is then encoded by querying the CAM module to see if the 32 bit data operand is a frequently used value. The store buffer is also extended with an extra bit to indicate whether the value is compressed or not. On receiving a write request with a compressed value, the cache controller simply activates the first sub-bank, writes the compressed code word, and sets the compressed flag, if the access signals a hit. Misses are handled similarly to read misses as described above.

We next introduce how the dictionary and CAM module are initialized. The paper proposes that the CAM module be trained in a small window at the beginning of the application, with FVC turned off (i.e. always access all banks and ignore the compression flag). Frequently used values are ranked as they are read and written. At the end of the window, the CAM module sends the ranked values to the dictionary, after which training stops, and FVC is enabled. Given a CAM size and dictionary size of N, the training phase takes a CAM of size 2N (which could share the first N entries with the encoder CAM). Each entry in the training CAM consists of the value and a 2 bit saturating counter. When a value is read or written, the value is looked up in the training CAM. On a hit, the saturating counter is incremented, and if the incrementation causes the counter is saturate, the entry is swapped with the next higher entry, and the counter is cleared. On a miss, an entry with the minimum counter value from the lower half (the lower N entries in the 2N CAM module) is evicted, and the new value is inserted. At the end of the training phase, the upper half of the 2N training CAM is copied to the encoder CAM (unnecessary if these two are the same structure) and the dictionary. The contents of both remain static during the rest of the execution.