Frequent Value Locality and Value-Centric Data Cache Design
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/356989.357003
Year: ASPLOS 2000
Keyword: Compression; FVC; Frequent Value
Highlight:
- Partial caching and dictionary encoding can be combined, with non-existing values using a special dictionary entry.
This paper proposes Frequent Value Cache, a compressed victim cache design optimized for frequently accessed values. This paper observes that most workloads exhibit some degrees of frequent value locality, such that a majority of accesses and values that occur in the memory are within a narrow range of values. For example, the paper points out that the most frequent ten values are responsible for roughly 50% of accesses and memory locations, having great potential to be optmized. The paper also presents three important observations with frequent value locality. First, value locality of accesses and value locality over the address space are surprisingly similar, not only on the values themselves, but also on the rankings of these values. The paper further discovered that frequently values are uniformly distributed over the address space. No matter how accesses are localized for certain memory reagions, the frequent values accessed by instructions are usually consistent with the values that exist in active memory. Second, the variety of individual values that may exist in the memory is less than expected, especially consider that loop variables and temporary variables will introduce a large range of distinct values. The explanation given by the paper is that most loop variables and temporary variables or intermediate results are optimized out by compilers, which only exist in registers without being ever written back to memory. Meanwhile, those variables that cannot be easily optimized out by compilers tend to be of less volatile types, such as permanent data structures, in-memory data sets, etc. whose lifetime is much longer than a single function or loop. Depending on the scenario, these values may happen to enjoy more regularity than temporary values. Lastly, the paper also points out that although processor value locality and memory value locality are two different concepts, they often share similar traits in terms of values and their frequencies. Processor’s value locality focuses more on the distribution of values of certain instructions, while the value locality in this paper aims at the overall value distribution in the memory hierarchy.
Based on the above observations, the paper proposes Frequent Value Cache (FVC). The paper assumes a direct-mapped L1 cache design for low-power mobile or embedded platform. The FVC is also organized as a direct-mapped cache to minimize access time, such that access latency of the main cache is unchanged. The FVC acts as a victim cache for reducing cache misses by buffering evicted blocks from the L1 for future reuse. The difference between a conventional victim cache and FVC, however, is that: (1) A conventional victim cache is organized as fully-associative cache, which is limited by access latency and power consumption, and hence could only support a few entries. FVC, on the contrary, is direct-mapped at the cost of higher miss rates; (2) Victim cache store evicted cache lines as-is, while FVC compresses these lines using a static dictionary. By adopting data compression, the FVC can be implemented with less space and power budget without significantly lowering the hit rate.
An FVC entry is similar to a conventional victim cache entry, with a valid bit, dirty bit, and address flag. FVC data slots, however, do not store actual data values. Instead, only “partial” cache lines with frequent values are present. This enables FVC to use short encoding for each frequently occuring value, since the size of the dictionary is small. As a result, FVC stores a cache line in the granularity of 32 bit words. Each 32 bit word, if it is in the dictionary, is represented by three-bit field representing the index into the dictionary. In addition, one of the eight three-bit values are dedicated to represent non-frequent values (the paper uses “111”). Non-frequent values cannot be recovered on-chip, must be treated as a cache miss. FVC maintains the invariant that a cache block can exist in at most one of the L1 cache and the FVC. Dirty blocks are also allowed, which must be written back on eviction. Note that eviction on FVC is trivial, since it is direct-mapped.
The paper does not mention how dictionary is maintained and generated, although it is claimed that frequent values are rather stable throughout the execution, and therefore can be pre-initialized via profiling. The paper also observes that both values and ranks become stable at early stages of the execution. As a result, the profiling process only needs a small part of the execution, after which the dictionary can be generated. The dictionary is expected to be implemented as a simple SRAM register file that can be accessed with low latency, due to the fact that dictionary lookup and FVC lookup are serialized. The ISA should also be extended with instructions for starting the profiling and initialzing the dictionary.
We next describe the operation of FVC as follows. On a cache lookup, the L1 and the FVC are checked in parallel for the requested address. Hits in L1 are handled as usual. If both cache misses, the request is handled as a miss, which is forwarded to the next level of the hierarchy. If the request hits the FVC, and the word contains a frequent value, the dictionary is read in the same cycle to decode the compressed value, after which it is returned to the pipeline. Hits on non-frequent words are treated the same as a cache miss.
Miss handling of FVC differs from the one in conventional cache. When the fill request is completed, the block is inserted into the L1 cache, if the miss is caused by the accessing missing both L1 and FVC. If, however, the miss is caused by a infrequent value miss, the block is first merged with newer words in the FVC, if the FVC indicates dirty state of the block, and then inserted into the L1. The block in FVC is invalidated. Evicted L1 blocks are inserted into FVC, after encoding the frequent values in the block using the dictionary. Since eviction is not on the critical path, the extra dictionary lookup will not increase access latency. On eviction, FVC also writes back cache lines to the memory, potentially performing a read-modify-write if it only contains partial data.
Writes misses in FVC due to infrequent values can be optimized as follows. If the value to be written is in the dictionary, then the miss is unnecessary, since FVC allows partial value to be cached. The store operation simply updates the infrequent value’s code with the code of the frequent value. Next time this block is evicted or merged with a refill block, this newer value will be preserved.