Compaction-free Compressed Cache for High-Performance Multicore System

- Hide Paper Summary

Paper Title: Compaction-free Compressed Cache for High-Performance Multicore System
Link: https://dl.acm.org/doi/10.5555/2691365.2691396
Year: ICCAD 2014
Keyword: Cache Compression

Back

Highlight:

Assign each segment a tag ID such that the data slot read logic can access all segments in parallel, and select those that belong to the tag that gets hit.
Use a crossbar consisting of an array of MUXs for selecting the wanted segments and outputting them to the next stage buffer using per-segment information as control signals.

Questions

I do not get why a per-tag ID register is needed. The ID can just be generated by a simple combinational logic using the ID of the hit signal, i.e., just use 0 - 16 for physical tag 0 - 16 as the ID. No explicit storage is needed.
I think I understood how the crossbar works, but the paper should shed more light on how segments are dispatched to the output buffer as continuous bits. Is this process purely combinational, or there is a state machine driving the multi-stage shffling? From the paper the best I can guess is that the MUXs can operate in parallel and hence a single cycle is sufficient for dispatching all segments. But them the problem arises as how do you write the values fron the MUX to the output buffer? In particular, why there is a global counter register, if all MUX operate in one single cycle?
This design may not be power-friendly since on a cache hit, all 256 byte segments in the bank are read out. Compared with the conventional design only a few segments are accessed, this can consume more power than the base line design. Also the MUX-based crossbar is power-hungry.

This paper proposes compaction-free cache design, which optimizes over the conventional compressed cache by allowing compressed lines to be stored non-continuously in the data slot. The paper points out that conventional compressed caches often suffer from external fragmentation caused by not being able to store compressed lines in a non-continuous manner. In these designs, even if the current set has suffcient storage for a compressed line, if none of the individual vacant segments is large enough for the new line, insertion operation would still require compacting the physical storage by moving existing lines around and updating their tag pointers, such that vacant segments will be merged and the insertion becomes feasible. The compaction operation incurrs extra cycles on the insertion data path, requires dedicated circuit, increases energy consumption of the cache, and complicates design of the compression engine.

One workaround is to statically partition the combined physical slot into equal-sized parts, and each tag implicitly assumes a fixed offset in the data slot. This approach may work if the compression ratio is rather stable and matches the granularity of partition. This paper, however, points out that different workloads demonstrate radically different compression ratios. To make things worse, even the same workload may demonstrate significantly different compression ratios at different stages of execution.

This paper assumes the following compressed cache architecture. The base line cache has four physical ways, providing 256 bytes of storage each set. The compressed design adds 3x more tags per set, allowing a maximum of 4x compression ratio and a logical 16-way per set. Each tag, as in a conventional design, contains its own address tag, coherence bits, control bits, and other metadata. There is no per-tag pointer or pointers, and hence access indirection is not required. The proposed design also honors the segmented cache paradigm. Data slots of all physical ways are merged into a single monolithic piece of storage that can be addressed by all tags (physically they are still implemented by several smaller banks for faster parallel access). The segment size is 4 bytes, while the size of a compressed line can vary from 8 bytes to 64 bytes, which translates to 2 to 16 segments. Although the compression algorithm is orthogonal to the proposal, the paper indicates that FPC is used.

To enable the compaction-less feature, each segment in the data storage is associated with a tag ID register tracking the tag whose data is stored. Segments also have valid bits to indicate whether the segment stores data for any of the tags, or it is just inactive.

On a read operation, the cache controller first compares tags as usual, and in the meantime, it reads out the data bank, tag ID bank, and valid bit bank of the set. If read access indicates a miss, the tag ID is generated by the lookup circuit, and used as the input to the filter logic. The filter logic compares the tag ID with per-segment tag IDs, and generates a bit mask indicating which segments contain the correct data. The valid bits are used to mask off invalid banks. Then in the next stage, the collapsing logic shuffles the 256 byte data bank, by moving all segments that constitute the logic line being accessed to a buffer as a continuous bit stream. This is achieved with a crossbar implemented using an array of multiplexers. With 256 byte set and 4 byte segmens, there are 64 1-to-16 MUXs, each accepting an input from a segment, and output to one of the 16 output segments (since a compressed line can contain at most 16 segments). The bit mask generated by filter logic is used as the control signal to guide input data to the corresponding output buffer slot. The shuffled line is then sent to the decompression engine. If the read indicates a miss, then eviction is invoked to select a victim and then write it back if it is dirty. Eviction does not trigger compaction.

On a insert operation, the cache controller similarly performs a lookup on the tag array. On a miss or a size change, the cache controller first evicts one or more compressed lines, if necessary, and then starts the insertion process. Insertion also works in a two-stage manner. In the first stage, the filter logic takes the inverse per-segment valid bit as the output bit vector indictaing which segments can be used for storing the incoming compressed line. Then in the second stage, the incoming line is unshuffled into a buffer using a similar crossbar architecture as discussed in the previous section and the bit mask from the last stage as control signals. Note that not all segments are necessarily used for storing the line, and hence only those that are selected are set, and the rest will be cleared. The unshuffled data is then written back to the data bank, and both valid bits and tag IDs for affected segments are updated.