Touche: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overhead



- Hide Paper Summary
Paper Title: Touche: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overhead
Link: https://dl.acm.org/doi/10.1145/3352460.3358281
Year: MICRO 2019
Keyword: Touche; Cache Compression; Cache Tag



Back

This paper proposes Touche, a cache compression tagging scheme for reducing tag storage overhead on compression cache slots. Conventional cache compression schemes often require adding more tags for the same number of data slots, and loosen the static mapping relation between tag and data slots. This way, more logical lines can be stored in the compressed form within a single data slot, while more than one tags are used to address them. A set lookup should read all the tags out and compare them with the requested address to determine if the requests hits or not.

The paper assumes a compressed Last-Level Cache (LLC) design, with arbitrary compression algorithms as long as the algorithm has a way to identify compressed block size using a few bits inddicating compression type. The paper observes that a tradde-off has to be made between cache area overhead and effective cache size benefit by applying compression. Overprovisioning of tags per slot of course increases the chance that more compressed blocks can be stored in the same data slot. This, however, requires more parallel tag read and comparison for every LLC access, which increases both area and power consumption. On the other hand, some prior designs propose using superblocks to reduce the number of tags per data slot. Superblocks are consecutive compressed blocks whose address tags can be encoded by a base address tag plus implicit offset in the data slot similar to sectors in a sector cache. In such a design, the cache set loses most of the associativity provided by overprovisioned tags, since only adjacent blocks can be stored in a way, resulting in reduced effective cache size.

The touche design solves this trade-off using hashed address tags. Hashing allows us to reduce the size of the tag given that the chances of collision are low. The hashed tag, called a signature, can be stored in the tag array design for full-sized address tags with other signatures of compressed blocks in the same data slot. Signatures enable the cache lookup protocol to rule out cache misses in the early stages of the process, since they can be immediately detected after a signature comparison. Cache hits, however, cannot solely depend on signatures, since signature collisions, however low it is, must be adddressed. The paper then proposes storing the full address tags at the end of the data slot, which are read out and compared against the requested address only if one or more signatures hit. Given the low probablities of collision, the data slot read cost is very likely going to be paid anyway, if a signature hit occurs, therefore rendering this scheme costless in terms of storage and latency in most cases. At the end of the paper, a similar but more space effective scheme is also proposed for storing superblocks consisting of four adjacent compressed blocks. The superblock is encoded in a way that is distinguashable from regular Touche blocks, which also has zero extra tag storage overhead. We next discuss each of these designs in details.

Signatures are generated when a request with an address is received. The set lookup protocol does not change. The higher 27 bits right after the index bits are used to generate the signature as follows. First, the 27 bits are divided into three 9 bit segments. Then the three segments are XOR’ed together to form a 9 bit string. The 9 bit string is then divided into 4 bit and 5 bit parts. Both parts index into a table consisting of 16 and 32 randomly generated entries, the sizes of which are 4 and 5 bits, respectively. After table lookup, the two entries from the tables are concatenated to form the final 9 bit signature. At system startup time, the cache controller populates these two tables with randomly generated strings, which will be used for signature generation throughout the rest of the power cycle.

Once generated, the signature is either used for tag comparison, or will be inserted into the tag array. The paper assumes 27 bit tag size in the LLC, plus five extra control bits: 1 bit for valid, 1 bit for dirty, and 3 bits for replacement (assuming an 8-way set-associative LLC). In the uncompressed case, the semantics of valid and dirty bits are implied by their names. If the data slot holds compressed blocks, however, these two bits use the unique combination to indicate to the cache controller that the rest of the tag and data slot should be interpreted in the compressed form, which we describe below. The unique combination will never occur during normal, uncompressed operation, in which the dirty bit is set, but valid bit is clear. The 27 bit address the tag stores signatures of the three compressed blocks in the data slot, or, if compression is not enabled for this slot, stores the 27 bit address tag. If one or more compressed blocks are invalid, the signature is set to a pre-defined value that will never occur during table translation, to avoid false hits and unnecessary lookup of the data slot. Since both tables are relatively small in size, the non-existent value can be easily generated. The replacement bits are used as usual. Touche does not encourage having separate replacement bits per compressed line. On cache line eviction, the victim is selected on a per-data slot basis. After the victim slot is selected, the eviction logic will randomly evict one or more lines within the slot, if it contains compressed lines.

Signatures matches are insufficient for determining cache hits or not, since different tags may happen to hash into the same signature value, although this case is rare compared with true hits. On a signature hit, it is necessary to further check the actual address tag which is stored in the data array. Note that Touche serializes tag and data array accesses. Although this adds a few cycles of extra latency, unlike L1 cache in which the latency is critical for performance, in an LLC design, it is likely that the tag and data array accesses are already serialized in the original read logic as a way of avoiding fully associative data array accesses. Such accesses are either power hungry, or incurs longer latency than just accessing a single tag array, since in some designs the data array has significantly longer latency than tags.

Compressed blocks are stored one after another in the data slot. At most three can be stored. The boundaries of the blocks can be inferenced from the compression size information. Metadata is stored at the end of the data slot. Three metadata fields are statically allocated, and each metadata field consists of a full address tag, valid and dirty bit, compression type, and other per-line properties such as the coherence state. On a tag signature hit, the data slot on the corresponding way is read out, and the full address tag is compared. A cache hit is signaled if the full tag matches.

Note that the three statically metadata fields are always at the end of the cache line, no matter how many blocks are actually stored in the data slot. This will slightly reduce the effective size of the cache. The paper claims, however, that the impact is minimum.

The paper also proposes support for superblocks in Touche using the superblock marker, or SMARK. Superblocks are four consecutive blocks aligned to the four-block boundary, compressed to fit within a single data slot. Superblocks is effective in further improving the effective cache size, since more blocks can be stored in a data slot. Although one more block is stored, superblocks can be supported due to its locality. Instead of storing three independent hash values in the tag array, superblocks only use the aligned starting address of the first cache line to generate the hash value, which is then concatenated with a special bit string, called a SMARK. The SMARK helps the cache controller to identify a superblock by using a randomly generated value not in the two tables for generating regular hash values. On tag accesses, the cache controller first checks whether a tag contains a SMARK. If true, then the hash value is compared with the hash value of the aligned request address (by clearing the lowest two bits). If the two values match, the data slot is read out for tag check. Per-block tags are not maintained. Instead, the aligned address tag is stored at the end of the data slot, in addition to per-slot status bits.

The paper did not talk about how superblocks can be discovered, except saying that the cache controller should be locality aware, and always check neighboring blocks for the possibilities of forming a superblock when inserting a block into the data slot. The actual SMARK algorithm also differs slightly from our description. The original paper admis the possibility of marker collisions with hash values, but did not mention how a SMARK can be distinguished from a regular hash value (“always checking the data slot” is not a valid argument, since we do not even know the layout of the data slot). In our description, we simply define the SMARK to be a value that will never be generated by concatenating strings from the two tables. Such a value is easy to generate with a random bit string generator since both tables are quite small.