A Unified Compressed Memory Hierarchy

- Hide Paper Summary

Paper Title: A Unified Compressed Memory Hierarchy
Link: https://ieeexplore.ieee.org/document/1385941
Year: HPCA 2005
Keyword: Cache; Cache Compression; ICC; ICC-C

Back

Highlight:

The trade-off between decompression latency and overall compression ratio can be dynamically adjusted based on the current miss ratio. Cache miss frequency serves as an indicator on whether or not more blocks should be compressed.
From another perspective: Not all blocks are necessarily stored compressed. They can be uncompressed to support fast read especially if the block’s compressibility is not good.
Using the same block and segment layout can help unifying cache and memory compression

Questions

There is no description on memory and bus compression. Even the cache compression scheme is an low effort one that is based on a previous design. What is the novelty of this paper?

This paper proposes ICC-C, a unified cache, bus and memory compression scheme. The paper summarizes previous works on cache, bus and memory compression as follows. First, cache compression helps in reducing SRAM access overhead for less power consumption, or increasing effective cache size for better performance. Second, data transferred on the system bus can also be encoded with frequent patterns. The sending end and receiving end both keep a dictionary whose contents are pre-synchronized to be consistent. Frequent patterns such as all-ones or all-zeros are encoded using dictionary codewords to reduce the number of bits on the bus. Since these are common patterns for both address and data, bus compression also helps increasing effective bandwidth and reducing latency. Memory compression, as pointed out by the paper, also increases effective memory bandwidth by remapping memory blocks to alternative addresses for more compact storage. Memory compression can be enabled for only a specific range or type of pages, or enabled globally for all physical pages transparently. Although not mentioned by this paper, memory compression schemes can also serve the purpose of saving bandwidth without having any capacity benefit. In this scheme, compressed lines are fetched and transferred on the bus, which are then decompressed by the cache controller at the receiving end.

This paper claims that a unified cache, bus and memory compression scheme can help further improve performance than any of the individual scheme. The unified scheme is designed to take all aspects of the memory hierarchy into consideration without costly interfacing between component boundaries, such as the cache-bus boundary and bus-memory boundary. For example, this paper assumes that a unified compression algorithm, LZSS, is used for all compressed components. The memory hierarchy also assumes unified block size, which is 128 bytes or even larger. The larger block size not only improves the effctiveness of compression, but also reduces metadata cost in the DRAM, as metadata is allocated per-block. Compressed blocks are not decompressed before they are placed on the bus. Instead, compressed blocks are directly transferred over the bus, which are stored in compressed form in both the LLC and the DRAM. Caches above the LLC still store uncompressed data for minimum access latency, though.

The paper is focused on the compressed cache architecture. The cache organization is based on Indirect Indexed Cache (IIC). All tags in IIC are mapped to data slots via an indirect link, implying sequencial accesses between tag and data. IIC is a dynamic associativity design based on tag chaining. IIC consists of two parts. The first part is a conventional LLC instance where cache lines are statically mapped to sets based on addresses, and cache lines can be mapped to any of the ways in a set. The second part is comprised with “free” tags and data slots. When a set needs extra data slots, instead of evicting an existing block as in the conventional cache design, the cache controller will allocate a free tag and the associated data slot to dynamically increase the associativity of the set. The tag is added to the set via a link pointer of the set. Tags may also form a linked list via pointers stored in each tag. An IIC lookup, therefore, must probe not only the conventional part of the tags, but also chained tags by following the link. The baseline IIC design may also over-provision more tags than the number of dara arrays. When a tag is allocated to a certain set, but no data slot is available, an existing data slot from another set can be evicted. Such eviction decisions can be made given global usage information, and therefore in an optimal manner, since the data array is fully associative in terms of tag mapping.

The IIC also features a customized eviction algorithm called generational replacement. The replacement algorithm features two distinctive design decisions. First, replacement decisions are made globally among all data slots, rather than within a single set. Global access information must be considered in order to make optimal decision. Second, locality information is only a filtered access trace at LLC level. Frequently accessed blocks may seem less recently used at LLC level, rendering LRU unhelpful. The genetation algorithm, therefore, tracks access frequency rather than recency using four priority pools. Each pool is organized as a FIFO stack. Blocks can be promoted or demoted into higher or lower priority queues only when they are at the head of the FIFO stack. A block inserted into the pool is always at the tail of the stack. Block promotion and demotion are driven by cache misses. Every time a miss occurs, the head block in each pool is demoted if their reference bits are not set. Otherwise they are promoted. The per-block reference bit is set when a request from upper level hits the block, and cleared in a movement. The evicted block is always selected from the lowest priority pool.

We next describe the changes to IIC proposed by this paper for supporting compression. First, the data array is divided into segments, which are finer grained storage units than blocks. The paper suggests that each regular block be divided into four segments, such that a compressed block, in the best case, only uses as little as one fourth of the original size. The second change is that each tag now has four, instead of one, indirect pointers. At most four segments can be addressed to form the logical data slot for the compressed line, although not all four pointers are always used. The last change is the selective compression policy that balances storage efficiency and decompression latency. The observation is that if the working set size is already smaller than the cache, compression actually negatively impacts performance, due to the decompression latency which is on the critical path of upper level reads. On the other hand, when the working set size is larger than the cache, compression can allievate or eliminate the extra cache misses by utilizing resources more efficiently. The selective compression policy seeks to attain balance between these two scenatios using cache misses as indicators. In the modified algorithm, a block is always stored in uncompressed form when it is accessed. The block is only re-compressed when it is moved out of the current pool. Recall that pool promotions and demotions are driven by cache misses. Frequent cache misses, meaning that the current working set size is larger than the cache, will cause more blocks be stored in the compressed form, increasing overall compression ratio at the cost of longer access latency. On the other hand, if cache misses are relatively infrequent, then most lines will be stored uncompressed, reducing performance impact to a minimum.

The paper did not give much description to implementational details of bus and memory compression. The bus compression is merely a result of sending compressed data over the bus without decompressing them first, thanks to the unified storage layout. Compressed blocks in the memory is also stored in a sectored manner. A sector mapping table is located at the beginning of DRAM hardware address space, which tracks per-block mapping from physical address space to hardware address space. DRAM blocks are stored in segments just like the LLC. Physical storage is saved sa long as less sectors are used for compressed blocks.