Cache Compression with Efficient in-SRAM Data Comparison

- Hide Paper Summary

Paper Title: Cache Compression with Efficient in-SRAM Data Comparison
Link: https://ieeexplore.ieee.org/document/9605440/
Year: NAS 2021
Keyword: LLC; Compression; In-SRAM Compression

Back

Highlight:

In Xeon LLC slice, set data of a single way is distributed over 4 banks. Cache blocks are stored as individual 64-bit words in SRAM arrays. Within each bank, 64-bit words from adjacent sets are stored in the same array physically next to each other. Banks on the same index from different ways share the same data bus, and only one bank on the same data bus can be active at a time.
The above data organization creates an opportunity for inter-set compression at the granularity of 64-bit words, since words from different blocks on adjacent sets are stored next to each other. Similarly, inter-way can also be done because words different blocks on adjacent ways are accessed via the same data bus.
In the baseline cache, the physical location of a word is determined by the set index and way number from tag comparison. In the compressed cache, the in-array and the way number physical location needs to be stored for every bank, to allow the compressed word to be stored in another location.
Compression can be performed in the granularity of 64-bit words, by comparing the word with words that are physically close to it in the array or from an adjacent array. If the values match, the word is deduplicated by directing the explicitly stored physical location of the word to the existing word.

Comments:

The metadata overhead on the tag array is huge. Given 2048 sets and 8 ways (the example of Xeon LLC slice in the paper), 11+3=14 bits are needed for every bank, and that is 56 extra bits per tag entry solely for indexing. This scheme assumes full search on the entire array and on all ways. The paper suggests that the search can be restricted to only a small number of sets, which can reduce metadata cost, but still, 4x physical pointers are needed. By contrast, on a conventional compressed cache with 4-byte segment, only segment index (7 bits) and compressed size (7 bits) are added per tag, which only sums up to 14 extra bits. Besides, the reference count (which also serves as the allocation map) are some big sources of extra metadata, which the paper does not even quantify.

This paper proposes in-SRAM data compression, a novel cache compression technique that leverages the physical organization of SRAM storage elements to perform compression and indexing of compressed data. The paper is motivated by the fact that most existing cache compression designs only treat the data array as a scratchpad, and base the design on such an abstraction. In reality, however, the cache’s data array is not a flat storage structure, but instead, consists of smaller units that can be separately addressed and may share control and/or data signal. The storage organization of caches can complicate a seemingly simple design using the flat storage abstraction, because data is indexed in a particular pattern that may or may not be compatible with the access pattern required by the compressed cache. Besides, in these designs, the unit of compression is typically cache blocks. If a cache block is compressed using another block as the reference, then both blocks must be fetched and decompressed at access time. Both operations will incur extra latency on the read critical path, and as a result, the benefits of compression diminish.

This paper proposes a different way of performing compression, which builds compression and decompression logic within the SRAM access logic. Besides, the unit of compression is designed in a way that is consistent with the granularity of storage and data indexing, which both simplifies the design, and allows finer grained compression.

The compression design is based on the last-level cache organization in Xeon processor, which we describe as follows. In Xeon, the LLC is partitioned into slices by cache sets, and each core has its own slice. Each cache slice can be considered as a separate cache in our context, because each slide has its own storage elements and access logic. Each way of a slice is implemented as a column, which contains all sets in that way, and each column is divided into four banks. Bank operate independently from each other, and each bank has its own access logic, which can read or write 64 bit data on the data bus in a single cycle. Banks on the same index from different ways share a data bus. Overall, the cache slide has four 64-bit data buses, shared among banks from different ways. Given a set index generated by the cache controller, and a way index generated by the tag comparison logic, the four banks from the selected way operate in parallel to access 256 bit data in a single cycle. The full 512-bit cache block is therefore accessed in two cycles by alternating the signal that specifies the lower or the upper half of the cache block to be accessed. Note that since all banks on the same index from different ways share the same control signal and data bus, at most one way (i.e., the selected way) can be active at a given time.

As mentioned earlier, the 512-bit cache block is stored in four banks, and each bank is responsible for 128-bit of data. Within a bank, the 128-bit data is stored as two 64-bit words in two separate SRAM arrays. Each SRAM array stores the 64-bit word for all sets in the slice, and they operate in parallel. Unfortunately, since the data bus per array is only 64 bits in width, only half of the word from each array will be read on each access. The cache controller, therefore, must also generate an extra signal to specify whether to read the upper half or lower half in the current access cycle. When the set index is given, and the array is selected, each array will read either the high 32 bits or low 32 bits from the corresponding location indicated by set index, and then the result will be concatenated and sent to the 64-bit data bus. Note: The above description simplifies the actual implementation. In the actual implementation, one single SRAM unit may not be able to hold a 64-bit word for all sets. In this case, more than one SRAM array is employed, and set index is further decoded into an array selection and offset within the selected array. The paper uses an example with 256 * 256 bit array, where the row size is 512, and each row stores a 64-bit word for four sets, and each array can store data for 1024 sets. The slice, however, consists of 2048 sets, meaning that two arrays are needed for 64-bit words on all sets, and each bank actually consists of four SRAM arrays.

In the baseline settings as described above, the physical location of each 64-bit word in a cache block is uniquely determined by the set index, which is generated using the physical address, and the way number, which is generated by tag comparison. With compression, the physical location of a 64-bit word can change, and hence must be explicitly stored in the tag entry. The paper proposes that both the set number within the array and the way number that is used to select the bank are stored explicitly in the tag entry. On cache accesses, when there is a tag entry hit, the storage locations are retrieved from the tag entry, and then used to access data. Since four banks operate in parallel on each access, four copies of the physical locations are stored in the tag entry, one for each bank. Note that the bank index and the array index (i.e., the extra one-bit control signal) of each 64-bit word is still hardcoded, such that banks can still operate in parallel to read 256 bits in a single cycle. The tag array itself is also 2x over-provisioned, such that at most twice as much data as an uncompressed cache may be stored in the compressed cache.

Compression is performed at block insertion time by deduplicating every individual 64-bit value in the block using existing values stored in the arrays. If an identical value is found, then the 64-bit word in the newly inserted line is not stored, but rather, the tag entry just stores the physical location of the existing value. In order to find potentially duplicated value, a search is performed on the array and on all other banks on the same index. This is essentially deduplicating the value using values on the same offset from another cache block on the same way but in different sets, or on a different way. The search space can be flexible, as long as they can be addressed by the tag and do not prevent banks from operating in parallel. If the value cannot be duplicated, then it will be allocated a physical location in the tag entry’s addressable range. This requires extra allocation logic as in a conventional compressed cache.

When a dirty write back from the upper level updates an existing block in the LLC, the block is removed by removing all 64-bit words from all banks, recompressed, and then re-inserted. When a block is evicted from the LLC, all 64-bit words are removed, and the tag entry is invalidated. To perform garbage collection for 64-bit words that are not referred to by any valid block, an extra reference count array is maintained for every 64-bit word. The reference count is decremented when a block referring to the 64-bit word is removed from the cache. A slot with the reference count being zero can be allocated to fulfill insertion requests.