A Space-Efficient Compressed Cache Organization for High Performance Computing



- Hide Paper Summary
Paper Title: A Space-Efficient Compressed Cache Organization for High Performance Computing
Link: https://link.springer.com/chapter/10.1007%2F978-3-540-30566-8_109
Year: ISPA 2004
Keyword: FCMS; Cache Compression; Memory Compression



Back

Highlights

  1. Use super-block tag entry, but do not encode all blocks in the super-block. Instead, only three blocks out of all 16 are encoded. As a result, the offset of the three blocks should also be stored in the tag entry.

  2. Do not need to store compressed size, if blocks are stored compactly and in a pre-determined order.

  3. I did not expect this paper to come up with what seems to be an early proposal of LCP. The actual LCP paper was published 9 years after this paper.

Comments

  1. The paper uses super-block of size 16. I am curious is there any justification for this design choice? Large super-blocks may incur tag contention on the same set especially when spacial locality is strong, since a consecutive range of blocks are mapped to the same set.

This paper proposes Fine-Grained Compressed Memory System (FCMS) as an improvement over a previous proposal SCMS. The authors noted that while SCMS reduces memory bandwidth and effective cache capacity, the maximum compression ratio is limited to two, due to the fact hat SCMS only compresses adjacent even-odd blocks into the same data slot. The paper observes that with a moderately good compression algorithm, many cache blocks can be compressed to less than 50% of the original size, which causes storage under-utilization, as blocks compressed to less than 50% of the original size will leave unused storage in the data slot (which is called internal fragmentation by this paper).

FCMS differs from SCMS by assigning multiple size classes to compressed blocks, rather than only having two (i.e., uncompressed, and half-sized). Compressed block sizes are assigned with different “size buckets”. The paper proposes using 16 buckets, meaning that for 64-byte cache blocks (the paper seems to suggest using 128-byte blocks, but it does not affect our discussion), compressed blocks between size 4i and (4i + 3) will be assigned to bucket i. Cache metadata is also organized such that compressed blocks can occupy a variable number of data buckets in the data array, allowing a more flexible management of cached data.

FCMS is based on SCMS. SCMS is a unified compressed LLC and memory design, in which blocks are transferred between the LLC and the DRAM in compressed form on block fetch and eviction. SCMS only attempts to compress adjacent even-odd blocks individually, and store them in the same 64-byte data slot. On the LLC side, SCMS manages compressed entries as super-blocks of size two, meaning that a tag entry stores the tag of the even numbered block, plus an extra bit to indicate whether the odd numbered block is also present in compressed form. Metadata such as dirty bits, and coherence states are duplicated on each tag entry as if there were two logical blocks per entry (replacement entries are not duplicated, though, as both blocks in the tag are always evicted together).

SCMS also skews the index generation function of blocks, such that block 2i and (2i + 1) can be stored in both sets computed from address 2i and (2i + 1) (these two sets are adjacent if the index generation function just uses lowest few bits of the block address). On a cache lookup, both sets need to be probed. This essentially doubles the associativity of SCMS cache.

SCMS manages uncompressed and compressed blocks on DRAM in the granularity of pages. Two size classes are supported, namely full sized pages and half sized pages. In half sized pages, all even-odd pairs are stored in compressed form. In full sized pages, some blocks are stored in uncompressed form. Both uncompressed blocks and compressed pairs are aligned to 64-byte boundaries for easy index generation. A bit vector in the page table entries for full-sized pages describes the data layout of the page, which is used to compute block offsets on an access request. The OS maintains a pool of both size classes, and is responsible for allocating pages, performing data migration on size class change, and maintaining page table entries.

FCMS uses a different cache and DRAM page organization. On the cache side, FCMS adopts the super-block based index generation for less metadata bits dedicated to address tags. Each entry only has one address tag, but three super-block offsets tags that encode three arbitrary addresses within the super-block. The super-block size is 16, meaning that 16 aligned and consecutive blocks on the physical address space are mapped to the same set. Given a block address, the index generation function selects the middle few bits as the set index, while outputting the higher bits as the tag, and the lower four bits as the offset. Both tag and offset are compared against the requested address on lookup operations. Per-block metadata states are duplicated three times, such that each tag entry encodes at most three logical blocks. FCMS does not use skewed index generation.

FCMS does not decouple data slots from tag entries, indicating that there is still one-to-one correspondence between tag entries and the data store for addresses encoded by the entry. Data slots are segmented into buckets for storing variable sized compressed blocks. Blocks must be stored in consecutive buckets, and all blocks in the same tag entry are stored compactly, i.e., there must be no external fragmentation, and in a pre-determined order. The latter property simplifies size computation, since the size of a compressed block in the unit of buckets can be derived by subtracting the current offset from the offset of the next block in the pre-determined order. The tag entry contains offset fields that stores the starting bucket of each compressed block.

On the main memory side, cache blocks are managed in page granularity. The paper does not mention whether there are different size classes and how many (this would be the same across different proposals, so I do not really mind it not being discussed). To simplify address generation for blocks within the page, blocks are not stored compactly, in which case the offset of each block would need to be maintained somewhere, and there would be frequent shifting and data compaction when the compressed size of a block changes. Instead, FCMS proposes what is similar to Linearly Compressed Pages (LCP, which was published in 2013, while this paper was written in 2004): For each page, we only maintain one single slot size, which is stored in the page table, cached by the TLB, and used by the memory controller to address blocks. The page-level slot size is chosen by the OS using an unknown criterion, and may subject to future changes if the compressibility of blocks on the page change significantly. The paper justifies this design choice by arguing that blocks on most pages have similar compression ratio, due to data locality (this is also the major motivation of LCP). As a result, block addresses on the page can be generated easily by adding the base address to block index within the page multiplied by the per-page slot size.