Zero-Value Cache: Cancelling Load That Returns Zero



- Hide Paper Summary
Paper Title: Zero-Value Cache: Cancelling Load That Returns Zero
Link: https://ieeexplore.ieee.org/document/5260542
Year: PACT 2009
Keyword: Cache; Compression; ZVC; Zero Compression



Back

Highlight:

  1. In a partially inclusive hierarchy as described in this paper, whether or not to probe or bypass the next level can be indicated by a bit vector showing whether the block is cached by the next level.

  2. Not updating the ZVC on entry miss but L1 hit helps L2 keep a consistent duplication of the tag array, since the L2 could see every operation that affects the data array, and it only needs to replicate the same logic that the ZVC uses to update its data array (e.g. replacement policy).

Questions

  1. The architecture in this paper seems fragile AF and prone to commplicated corner cases and/or coherence anomalies. For example, what about the cohrence of the CIB vector when L2 loads a new block that was marked as “not cached by L2”?
  2. Many important details are missing, i.e. in the vertical ZVC-L1d design, what if only part of the word is zero-filled, but the rest are non-zero? The cache controller must perform a second read to fetch the non-zero values.
  3. The paper does not mention how store instructions are handled.
  4. If loads are issued to the ZVC before to the LSQ, the load instruction will not be able to read the content the same core wrote earlier to the same address, if the ZVC indicates a hit, since the LSQ is only probed on a ZVC miss.

This paper proposes Zero-Value Cache (ZVC), an L1 cache extension optimized to accelerate access for zero load values. The paper begins by claim that latency of loads are part of the biggest factor that contribute to processor slow down in some workloads. The paper then points out that a non-trivial portion of loads actually only read value zero, which can easily be optimized out if these loads could be identified. Cancelling out these critical loads that are supposed to return zero before they reach the memory hierarchy helps improve instruction throughput and overall performance.

We next describe the organization of the ZVC. ZVC is organized as a set-associative cache, with only tag arrays but no data. Each tag consists of an address tag, a valid bit, a zero bit vector, and a cache indcator bit (CIB) vector. The address tag stores the block address for which the entry covers. ZVC does not necessarily use the same block granularity as L1d and the underlying L2 cache. In fact, larger blocks are preferred, in order to reduce the tag overhead relative to effective ZVC size. The valid bit serves the same prupose as in a conventional cache. The zero bit vector indicates whether the corresponding location contains zero. The granularity of the zero bit can either be byte-level or word-level, which is a trade-off between space overhead and detection accuracy. The CIB vector is an optimization that helps to bypass the L2 cache, if the block is known to be not cached by L2, as we will see below.

ZVC is not necessarily inclusive regarding the rest part of the hierarchy, since ZVC can potentially use a larger block size, which will force lower level cache to use the sector cache design, if inclusion is enforced. In an inclusive cache design, adding ZVC to the L1 level or above may incur some significant change on coherence handling, since the L2 no longer serves as a coherence filter for ZVC, resulting in a design where all coherence messages are forwarded to the ZVC. To solve this problem, the paper proposes that a duplicated tag is added to the L2 coherence controller. The duplicated tag array is updated whenever a fetch request from the ZVC is received. The protocol that we describe below guarantees that the duplicated tag array will remain consistent with the actual ZVC by not inserting the entry on L1 misses. Besides, the VIB vector described above also serves as an optmization to the non-inclusive architecture, since it is possible that only a subset of the blocks in a ZVC block are stored by lower level caches.

The paper proposes two possible configurations of the ZVC cache. In the first configuration, the cache is added between the L1d cache and the load-store queue (LSQ), which serves as a zero-value filter to load instructions. The paper suggests that, under this configuration, the timing of ZVC accesses is critical, which should be completed in the same cycle that the load instruction is dispatched. In this case, the load instruction takes no time to execute, if it hits the ZVC. Instructions that miss the ZVC may suffer from extra latency for unknown reasons, since the paper does not elaborate on this part.

Two types of ZVC misses may happen on a load access. The first type is data miss, which occurs when the tag matches, but data is non-zero. In this case, a conventional cache access is initiated to the L1, which may possibly also access the L2 and the rest of the memory hierarchy. The second type is entry miss, which happens when no tag is matched. In this case, the request is also forwarded to the cache hierarchy to resolve. The difference between entry miss and data miss is that, if an entry miss is handled by the L2 or memory, the corresponding entry is also inserted into the ZVC, with zero bit vector and CIB initialized correspondingly. The reason that CIB is not updated on L1 hits is that the duplicated tag array, which serves as the coherence filter as we described above, resides at the L2 level. The duplicated tag array must be updated in consistent with the ZVC tag array, which only occurs when the L2 processes the request, but not L1. When an entry miss is handled by the L2 or memory, a special zero detection logic is activated to derive the zero bit vector. The paper did not mention how the CIB is updated if the request hits the L2. But if the request hits memory, then in addition to fetching the target block to fulfill the L1/L2 fill request, the memory controller is also supposed to stream out the larger block cached by ZVC, and runs zero detection logic on the whole block, which is then cached by the ZVC. In this case, only the target block is set in the CIB vector. Although this might increase memory bandwidth, since more data is streamed on the system bus, the paper claims that the impact would be minimum, which is outweighed by the benefit of zero cancellation. When a new entry is inserted, an existing entry is evicted, if all entries in the set are valid. The paper suggests that LRU be used for selecting the victim.

Store instructions that write value zero directly update the L1 cache, which does not affect correctness, since on a data miss, the L1 is always accessed. Store instructions writing non-zero values, however, must clear the bit if an entry exists in the ZVC, since otherwise, later accesses will not be able to read the correct value. This operation does not affect the tag array; Only the bit vector is updated.

The paper also proposes allowing speculative zero values from a store to be entered into the ZVC before the store commits. In the previous discussion, we assume that store instructions are retained in the LSQ, and only issued to the ZVC and L1d as soon they commit. A more aggressive scheme would be to allow uncommitted LSQ stores to allocate an entry in the ZVC, allowing more loads to bypass zero stores with no cycle penalty, if such store-load forwarding is frequent. On a speculation failure, however, the hardware must roll back these stores by setting the bit vector back to its original value. Note that simply flipping these bits to 1 may not be correct, since the original bit value may as well be zero. The hardware should keep a undo log structure along with the ZVC for speculation roll back.

The second type of configuration is to use ZVC as a peer cache of the L1d, which is probed in parallel as the L1d cache. In this configuration, the benefit of early cancellation is no longer valid. Instead, the effective size of the L1d increases to hold zeros, which saves bandwidth from L1d to L2 if these zeros incur cache misses if ZVC is not present. More power is also consumed by this configuration, since both caches are activated in the same cycle. There is no panalty, however, if the access misses the ZVC. Store instructions update both the ZVC and L1d, since these two are not mutually exclusive according to the paper. When both L1d and ZVC miss, if the entry in ZVC exists, and the block is indicated as not cached by the L2 (i.e. the block is streamed from DRAM which is not fetched into the L2), then a request is directly sent to the DRAM to avoid traversing the cache hierarchy. Otherwise, the block is likely cached by L2, in which case a request is issued to the L2.