An on-chip cache compression technique to reduce decompression overhead and design complexity



- Hide Paper Summary
Paper Title: An on-chip cache compression technique to reduce decompression overhead and design complexity
Link: https://www.sciencedirect.com/science/article/pii/S1383762100000308
Year: Journel of System Architecture 2000, Issue 46
Keyword: Cache; SCMS; Cache Compression



Back

Highlight:

  1. Letting OS decide which blocks should be compressed and hardware only follows is a illusrtration of good division of responsibility (The OS initialize the page table and compressed data layout, and hardware follows this layout unless a fat write occurs).

  2. The seamless integration between page-based memory compression and super-block based cache compression saves metadata overhead. Compressed blocks are not decompressed before and after transferred over the system bus, maintaining a uniform layout between memory and LLC.

  3. It is a good balancing point between efficiency and simplicity to only compress adjacent even-odd blocks.

  4. Using page table to store the small compression vector for describing page layout to achieve minimum DRAM metadata overhead.

  5. Using adjacent sets to store compressed blocks in the LLC to minimize LLC metadata overhead. This is just a
    simple skewed design, in which the candidate sets are the set of block 2i and (2i + 1). This essentially doubles associativity, since both blocks in the pair can be stored in both sets.

To summarize 3 - 5: There are two major trade-offs in the paper: (1) Trade-off between compression effectiveness and cache management difficulties; (2) Trade-off between DRAM saving and page mapping metadata cost.

Comments

  1. It seems that a small page (half-sized page with compressed blocks) can only be achieved when all even-odd blocks are compressed together? As long as there is a block that cannot be compressed into an even-odd pair, the page must be large page.

  2. Relevant to (1): For small pages there is no need to maintain the bit vector, because there is only one layout. For large pages the bit vector is necessary, since some pairs may just be compressed together.

  3. Although the LLC and DRAM are in the same compression domain, LLC still uses logical block address (uncompressed address) rather than the actual physical location. The paper should have made a better point of this.

This journal article proposes Selective Compressed Memory System (SCMS), a unified cache and memory compression scheme for reducing storage overhead and increasing performance. The journal identifies three challenges of cache and memory compression designs. First, decompression is on the critical path of access operations. The decompression architecture must be able to deliver raw bytes with high throughput and low latency. For a unified LLC and DRAM compression scheme, the decompression bottleneck occurs on the datapath between the LLC and the upper level cache, which requires certain hardware structure to deal with. Second, compression algorithms, in the worst case, may produce blocks even larger than the uncompressed size due to extra metadata bits in the compressed stream. The compression scheme must be able to identify memory blocks that are not easily compressible, and disable compression on these blocks to minimize the negative impact. The last challenge is that the size of compressed blocks may change, requiring potential layout changes if compressed blocks are stored compactly. The article calls this as “fat write” problem. When fat writes occur, either the layout of compressed blocks is changed, or the compression mechanism leaves some “slack” to absorb such changes in some degrees.

SCMS solves the above issues using a unified DRAM and LLC compression architecture. First, compressed blocks are transferred in the compressed form over the system bus. No extra compression and decompression is performed on data exchange between DRAM and the LLC. Second, SCMS only compresses two adjacent even-odd numbered cache line sized blocks into one block, achieving a maximum compression ratio of 2:1. This design decision not only simplifies the decompression algorithm due to the lower requirement on compression ratio, but also alleviates “fat write” problem, since it can tolerate slight size changes after writes, as long as the compression ratio is still above 2:1. Third, SCMS features simple layouts of compressed blocks on both DRAM and LLC. Blocks are still stored as pages in the DRAM, just with two different size class pages, which can be addressed with simple offset arithmetics. In the LLC, two compressed blocks always occupy the same physical slot, which can be in one of the two sets these two blocks’ addresses map to. The simple layout reduces metadata overhead to a minimum, as neither extra tag in the cache nor extra metadata storage in DRAM is required to perform address mapping for compressed blocks. Lastly, SCMS relies on the Operating System to make compression decisions. Compression decision are made when the OS initializes a page’s contents and its page table entry. Hardware solely follows the compression decision as best as it could, and only notifies the OS when necessary (e.g. a new page must be allocated due to layout changes). This way, applications and the OS can identify pages that need compression based on program semantics. Uncompressed pages are accessed like normal at unimpeded speed, while pages with bad compressibility can also be avoided.

The compressed cache follows the conventional LLC design, with a static mapping between tag and data slots, without tag over-provisioning and indirection. Each tag is only extended with a “compressed” bit (C bit) to track whether it contains two adjacent blocks in compressed form, or a single uncompressed block. The set mapping between block addresses and physical sets are slightly different from the conventional scheme, since in SCMS, block i and i + 1 (assuming i is an even numbered block) are compressed to the same physical slot, while they will map to two adjacent sets in the cache. SCMS allows the compressed composite to be stored in either set i or set (i + 1), with the tag always setting to the tag of block i, and with C bit setting to one. On a cache lookup, both set i and set (i + 1) are probed in parallel. A cache hit is declared if the full tag matches. The C bit is then checked to determine the layout of the physical slot. If C bit is cleared, then the slot contains uncompressed data. Otherwise, the slot contains compressed blocks. The block addresses can be recovered by concatenating the tag with i and (i + 1). To support parallel tag checks on two adjacent sets, the paper proposes that the tag array be further partitioned into two sub-banks, one for even numbered sets, and the other for odd numbered sets. On a set lookup, the index bits are extracted with the lowest bit masked off as i. Then set i and i + 1 from the sub-bank are read in parallel. This design essentially doubles the associativity of compressed blocks, which helps reducing unnecessary evictions when compressed blocks are inserted into the LLC.

On accesses to compressed blocks in the LLC from the upper level, the compressed block is sent to the decompressor for decoding. The decompressor has a buffer, which lies between the LLC and upper level caches and serves as a small cache for the most recent decompressed blocks. The paper proposes two optimizations for the decompressor. First, the critical word of the request can be delivered as soon as it is output from the decompressor. Second, if the decompressor is processing a block read from the DRAM (which happens when an access also misses the LLC), the transfer of the block and the decompression process can be pipelined, i.e. decompression could start as soon as the leading bytes are received from the system bus. This reduces the latency of accesses since it essentially overlaps bus transfer with decompression. When dirty blocks are written back from the upper level, the cache controller should first check whether the block is currently stored in the LLC in compressed form. If true, then the compressed block is first read from the cache, decompressed, updated with the evicted dirty block, and then recompressed. If the block after compression still fits into a physical slot, then the physical slot is updated. Otherwise, only uncompressed blocks will be stored in the LLC. One of these two uncompressed blocks use the current slot, while the other one needs to find an empty slot in the other set. Note that uncompressed blocks remain in the compressor buffer as a means of providing fast access to the most recently accessed data. The buffer is checked in parallel with an LLC lookup, and has the priority of providing uncompressed contents to the upper level cache if it is hit.

When blocks are evicted from the LLC, it is directly transferred on the bus without having to be decompressed first. Compressed blocks are still organized into pages, and are stored in an address aligned manner on the physical page (i.e., both uncompressed blocks and compressed pairs are stored at block boundaries). SCMS conserves memory by using two size class pages. Half-sized paged are used if compression of adjacent blocks can reduce the size of the page to less than half of the conventional page size. Full-sized pages are used otherwise. The OS is still responsible for allocating pages. The paper suggests that both half and full pages should align to the corresponding size boundaries to ease hardware address translation. One more bit is added to the physical page number in the PTE to support half-sized pages. The layout of compressed blocks are described by a small bit vector in the page table entry. Bit i in the vector indicates whether block 2i and (2i + 1) are compressed together (note that we only need half the number of logical blocks per page as the number of bits). The mapping from virtual block address to physical block addresses happen during MMU page walk. Once the page walker locates the PTE, the bit vector is also fetched for computing block offsets. A “1” bit in offset i indicates that blocks 2i and (2i + 1) are mapped to the same block offset. The block offsets are added to the base physical address stored in the PTE to derive the physical block address.

When the OS allocates a page and initializes its page table entry, the OS determines how the blocks in the page should be laid out by setting the bit vector in the PTE. Block layouts are also updated, if a dirty block is written back from the LLC, and the compressibility of the block changes. If the block was not compressed, but the write back request contains a compressed block, then both the block and its adjacent neighbor will be stored in compressed form in the page, after which the bit vector is also updated. Page shrinking may also be necessary, but is not on the critical path. On the other hand, if a block was stored compressed, but written back in an uncompressed form, the DRAM controller should first decompress the compressed pair on the DRAM page, and then unpack them as two uncompressed blocks, after which the write back is applied. The rest of the blocks in the page should also be shifted to accommodate for the newly unpacked block. If the page is already full, which only happens for half-sized pages, then the OS should be interrupted for allocating a larger page.

The paper proposes using dictionary-based X-RL as the compression algorithm. The hardware compressor and decompressor work on blocks twice of the size of a cache line in the LLC. The input stream is first divided into two equally sized sub-blocks, and then compressed independently using two compressors. The output streams are stored in an byte-interleaved manner (until one of them runs out of bytes, after which bytes from the other stream are just sequentially placed), such that decompression for both streams could begin early after the first few leading bytes are received from bus transfer. A special header is attached to the header of the resulting stream as metadata. To further accelerate decompression, the paper also suggests that the decompression buffer is flash cleared to all-zeros before decompression starts. If the algorithm recognizes a consecutive range of zero bytes, no actual data needs to be written, and the circuit can just skip these zero bytes for better throughput.