Zero-Content Augmented Caches



- Hide Paper Summary
Paper Title: Zero-Content Augmented Caches
Link: https://dl.acm.org/doi/10.1145/1542275.1542288
Year: ICS 2009
Keyword: Cache; ZCA; Zero Compression



Back

Highlight:

  1. Using a data-less sector cache organization to store zero blocks. This avoids one of the biggest problem of sector cache: low space utilization if spatial locality is low.
  2. Identifies that VA value locality makes little sense in PA if the page boundary is crossed.

Questions

  1. Writing to zero blocks may incur unnecessary coherence messages for requesting exclusive permission. This is because the sector cache has not way of distinguishing a S state block from an E state block.
  2. Did not talk about how inclusiveness property can be maintained. If the sector cache is only deployed at L1 level, then L2 level cache must have larger effective capacity to include all blocks mapped by L1 cache, which is not always possible. When a zero block is made dirty by writes in the sector cache, according to the paper’s description, that block will be immediately written back to the L2. This will cause problem if the L2 is not inclusive, since no slot in the L2 could hold such a write back. The same applies with coherence, e.g. what if a peer L1 writes the address buffered as a zero block in another core? Does L2 have a directory entry for that zero block it does not cache?
  3. When a zero block in the sector cache is written by a zero store, do we acquire exclusive permission via coherence or not?

This paper proposes zero-content augmented caches (ZCA cache), a simple cache compression scheme that only compresses zero-filled 64 byte blocks. The paper begins by identifying three important properties of zero-filled blocks in common workloads. First, a non-trivial number of cache blocks are filled with zero for reasons such as zero initialization or using zeros to represent a common state (e.g. NULL pointers, default values). Such regularities in application’s data value pattern enable the cache controller to perform zero compression by eliminating the need to store these easily compressable blocks. In addition, some applications even demonstrate high percentage of zero blocks throughout the entire execution, indicating the effectiveness of zero compression on these workloads. Second, zero blocks themselves exhibit spacial locality. If one cache block is filled with zeros, the adjacent blocks are also likely to be filled with zero, suggesting a design using larger block size per tag to handle zero blocks. The last observation is that writes to zero blocks also tend to be zeros sometimes, which further proves the effectiveness of zero compression. The compression should also be performed on a per-write basis to detect zero writes for maximim chances of compression.

The ZCA design can be deployed on any of the levels in the hierarchy. It is expected to retain the original lookup latency of the cache, and therefore will not affect cache lookup performance. ZCA extends the existing cache object with a small data-less sector cache. Each tag in the the sector cache represents the zero status of a large range of memory. Since zero blocks do not need any physical storage, the sector cache does not need a data array. Instead, a vector of valid bits suffices for describing zero status of all standard sized blocks in the sector. The paper also notes that although there is no hard limit of how large a sector could be, it is not recommended to have sectors larger than a physical page, since locality on virtual addresses is only meaningful within a page. Addresses that are close to each other on different virtual pages may be mapped to different physical pages by the underlying OS, which undermines the effectiveness of sectors.

To minimize design complexity, the paper also suggests that only clean zero blocks be stored in the sector cache. This design has three benefits. First, no data path is needed to evict a dirty block from the sector cache, as all blocks in all sectors are clean. Second, there is no need to add a dirty bit vector to indicate whether a block has more recent content than its lower level counterparts. Third, in the case of external coherence events, these zero lines are always assumed to be in a clean and potentially shared state. Similarly, when the cache controller intends to write a zero block in the sector cache, exclusive permission must always be obtained from all peer caches.

On a cache read request, both the normal cache and the sector cache are probed in parallel. At most one of these two can signal a hit, since they maintain mutually exclusive data. If the hit is on the sector cache, then no actual data is returned, and the cache controller simply sets the MSHR for the request to all-zero. If the request misses both caches, the controller forwards the request to the lower level as usual. When the refill response comes back, the controller checks whether the block is a zero block by OR’ing all bits in the block together. In the case of a zero block, the block is inserted into the sector cache by setting the corresponding valid bit, if the sector tag exists, or, if the sector tag does not exist, an existing entry is evicted silently and the new entry is inserted. The paper argues that the zero detection circuit is not on the critical path, since the refill request can bypass the current level of the cache and directly forwarded to upper levels, while zero detection and insertion are performed in the background.

Cache write requests are handled differently than in a normal cache. There are two forms of writes depending on where the the sector cache is in the hierarchy. If the sector cache is deployed at L1 level, the writes are mostly single word update, which can turn a non-zero block into zero, or turn a zero block into non-zero, or hits a all zero block, but the content is still all-zero (i.e. a zero write). In the first case, the non-zero block must be found in the regular cache, and it turns into dirty due to the write. In this case, we do not transfer the block to the sector cache, since the sector cache only holds clean zero blocks. In the second case, the zero block is removed from the sector cache by clearing the valid bit in the sector, and then evicted to the lower level. Note that we also does not try to insert the non-zero block back to the regular part of the cache, since most likely this insertion will cause another non-zero block to be evicted. In the last case, the block status will not change, since the zero block still qualifies. Theoratically speaking, the coherence state of the block should be dirty, but in practice, it will never be explicitly written back when evicted, since the value of the line never changed. In order to detect zero writes, an extra zero detector is added to the data path from the upper level.

If the sector cache is deployed at any level except L1, writes from the upper level will be in the form of 64 byte full cache lines, rather than 8 byte words. Most of the handling in the previous section is still correct. It is just that the zero write detector should correspondingly be able to detect 64 byte zero writes, instead of 8 bytes.

Although dirty blocks in the regular cache will not be transferred to the sector cache when it becomes a zero block, as discussed earlier, the paper proposes that another zero detection circuit can be added to the eviction data path, such that if the block is detected as zero block, it will be inserted into the sector cache on eviction. In this case, the block remains in clean status in the sector cache, which reduces cache misses if it is referenced again shortly.

If the zero cache design does not bring much performance benefit, which is evaluated periodically by the cache controller, the paper suggests that the sector cache can be entirely disabled to save the energy of quering the sector cache at all.