Restrictive Compression Techniques to Increase Level 1 Cache Capacity



- Hide Paper Summary
Paper Title: Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Link: https://ieeexplore.ieee.org/document/1524171/
Year: ICCD 2005
Keyword: Compression; L1 Compression



Back

Highlight:

  1. Explicitly pointing out that a feasible compression algorithm for L1 must support random access of code words; Decompression on access is not acceptable, since only the requested word will be sent to the LSU, unlike lower level caches where the entire line is sent to the upper level.

  2. When there are two compressed lines in a physical slot, and if the entire slot is to be invalidated, the LRU of the slot should be the larger (more MRU) of the two.

  3. Exceptions (occasional words that cannot be compressed) can be represented using an extra 16-bit data slot and a per-logical line bit mask; Adding multiple data slots are also fine, but the bit mask should indicate which slot it is.

  4. Address tags can also be compressed based on the observation that most address tags share higher bits. Only storing distinct lower bits, and using a common higher can also save SRAM storage.

Questions

  1. Although the compression and decompression algorithm is indeed simple, the paper ignores the fact that there are implicit complexities, such as finding a slot for insertion. In an uncompressed cache, this is as simple as selecting the bottom of the LRU stack. In our case, we also need to evaluate whether to evict a compressed line, or an uncompressed line. One possible argument is that this is not on the critical path, so it does not directly affect access latency. It is still, however, likely that the excessive tag access will contend with normal access.

  2. The paper also does not discuss how eviction candidates are selected. For example, when both compressed and uncompressed slots exist in a set, and we are inserting a compressed line. Should we use the compressed line eviction logic, or use the uncompressed line eviction logic? We definitely cannot choose an arbitrary two compressed lines and evict, since they may not be in the same slot. It is also unlikely that compaction will happen across slots.

  3. For address tag compression, it is required that the cache controller search for the most appropriate physical to insert a line, or any slot is fine?

This paper proposes a restrictive cache compression scheme for improving L1 cache effective capacity. Applying compression to L1 caches can help further improve performance and reduce energy consumption, since more data blocks are cached by the L1, and less requests are sent to lower levels L2. The paper points out at the beginning, however, that conventional compression algorithms are not feasible for L1 cache compression, since they introduce a few cycle’s latency during compression and decompression. Even though the absolute number of cycles can be small, they are still significant relative to L1 cache’s access latency, which is typically only a few cycles.

The paper observes that, all previous compression schemes change the bit offset of words in the compressed line, in a way such that they must be explicitly computed, incurring more cycles. In order to design a compression algorithm for L1, individual words in the compressed line must be able to easily recovered with random access, such that they can be fed into the pipeline immediately after the compressed line is read out. The paper calls this property as “restrictive” as it narrows down the design space to only a few possibilities.

The L1 cache architecture is described as follows. Each physical cache line slot is over-provisioned with an extra tag, which supports up to 2x compression ratio. The extended tag contains at least an address tag, MESI state tag, and LRU tag. Besides, each physical slot also has one bit indicating whether compression is enabled for the slot. If compression is enabled, the slot stores two lines compressed to half of the slot size, and each of the two tags describes a compressed line. Otherwise, compression is disabled on the slot, meaning that the slot stores an uncompressed line, and only one tag is active.

Compression is performed as follows. On inserting into the L1 cache, the compression circuit checks whether all bits in the higher half of each word to be compressed equals the sign bit of the lower half. If true, the word is compressed to half of its original size by eliminating higher bits. Otherwise, the word is uncompressable, which also renders the entire line uncompressable. The compression circuit is extremely simple: On the insertion data path, we add an array of comparators to check higher bits in parallel for all words, and then use a multiplexer to determine which version of data is output. The per-slot bit is also set, if the line is compressable, and the slot is currently empty (if the slot already contains a compressed line, then just fill it in the slot).

Decompression is exactly the reverse of compression. If the cache line to be accessed by LSU is from a compressed slot (the per-slot bit is set to “1”), the byte offset into the uncompressed cache line is right shifted by one, which is translated to the byte offset of the corresponding word in the compressed line. Then the half word at the byte offset is read, and the original value is recovered by duplicating the sign bit. Multi-word accesses can also be implemented similarly, except that the size to be is also right shifted by one. The decompression circuit is even simpler: Right shift and sign extension can be performed by combinational logic without any shift register or states. The final result is delivered to the LSU using a mutiplexer, which selects from the uncompressed and the compressed version using the per-slot bit.

Eviction is more complicated, since now there are two possibilities: Either a physical slot is invalidated, evicting all compressed or uncompressed lines, or only a compressed line is evicted, freeing a half-slot for another compressed line. The paper suggests that two different LRU schemes be used to select the eviction candidate. When a compressed line is to be evicted, the eviction algorithm uses per-logical line LRU, which is associated with each valid tag in the set. When a full slot is to be evicted, the algorithm uses per-slot LRU, which is not explicitly stored, but can be easily computed by using the more “recent” LRU among the two, if two valid lines exist. This guarantees that non-LRU lines will not be evicted. The paper, however, does not define when a compressed / uncompressed line should be evicted.

The paper also proposes a few more enhancements to the base line algorithm. The simplest of them allows a compressed cache line to have at most one uncompressable word, totalling to two extra half-words per physical slot. To achieve this, each physical slot’s data bank is extended with two half-words, each potentially storing an upper word of an uncompressable word in the logical line. To identify which word the extra half-word belongs to, each logical line also has a bit vector, in which at most one bit can be set. If a bit on position i is set, then word i is uncompressable in the original line, and the word can be restored by bit-concatenating the extra half-word as the upper half, and the half-word in compressed location as lower half. This modification either adds one more cache read cycle to generate the upper half of the uncompressable word, or adds an extra level of multiplexers to select from two versions of the restored word for all words.

The second enhancement allows more exceptions per logical line. This is achieved by adding even more half-words per physical slots. The paper proposes one scheme, in which four half-words are added, and word 0, 1, 2 can be used by the first compressed line, and word 1, 2, 3 can be used by the second compressed line. The bit mask should also be extended to indicate which half-word slot a compressed word uses. The bit mask contains two bits per word, which is mapped to slot 0, 1, 2 and slot 1, 2, 3, respectively.

The last enhancement is address tag compression. This technique is based on the observation that most address tags share common higher order bits. Storing both tags separetely, therefore, wastes tag storage, which is over-provisioned. The paper proposes that one higher order bit tag is maintained, which contains 20 bits. Two lower-order tags containing 2 bits each are concatenated after the 20 bit shared tag to form the tag for address lookup. Cache lines can only be stored in the same physical slot, if there tags share the highest 20 bits.