A Robust Main-Memory Compression Scheme
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1109/ISCA.2005.6
Year: ISCA 2005
Keyword: Compression; Paging
This paper proposes a simple but yet efficient main memory compression scheme. The proposed scheme is meant to be general enough to be applied to any system with minimum application level modification, and also efficient for space saving under the current virtual memory framework. The paper identifies three major challenges of designing a main memory compression scheme. First, decompression is on the critical path, since the processor has to stall the read uop to wait for the decompression to complete. Compression, on the other hand, is only triggered when a line is evicted, which can often be performed on the background. Second, a compressed main memory no longer maps virtual addresses to physical addresses linearly within a page. The mapping depends on both the compressibility of the page and the placement policy as well as page layout. This inevitably adds more metadata to maintain for each page. A carefully designed compression scheme should prevent these metadata from incurring extra memory bandwidth and/or occupying too much on-chip space. The last challenge is that compressed pages or blocks may not always be stored compactly within a page. Fragmentation is always a concern when both blocks and pages are variably sized, which reduces the efficiency of space saving and complicates OS storage management.
To solve the first challenge, the paper observes that in some cases, a number of cache lines are just filled with zeros, which makes them a perfect candidate for highly efficient compression. Besides, frequent pattern compression (FPC) also works pretty well on most of the workloads. Compared with more complicated, directory-based schemes, using a variant of FPC optimized for zero-compression has the following benefits. First, these algorithms are easier to implement on hardware, and they only incur a few extra cycles on the critical path for decompression. Second, zero is a special values whose usage is ubiquitous, e.g. for initialization, sparse matrix, representing NULL pointers, etc. Optimizing for zero can achieve a reasonable compression ratio even compared with more complicated algorithms. In fact, the paper suggests that 50% compression ratio is achievable with the simple variation of FPC. The paper also noted that the selection of compression algorithm is orthogonal to the compression scheme discussed in later sections. Any reasonable algorithm fits into the framework as long as it is implementable on hardware.
The overall model is discussed as follows. In addition to existing memory management units such as pages and blocks, the paper proposes dividing logical pages into smaller units called subpages. Given 8KB page frames, the size of subpages can be 1KB. Subpages are introduced to avoid excessive relocating of cache lines within the page when the compressed size of a line changes. Compression is performed on cache lines only. Compressed lines are classified into size classes. The paper uses n to denote the number of size classes p1, p2, …, pn. Among these size classes, p1 is always zero-sized lines (after compression) consisting of only zeros, while pn is uncompressed. The usage of size classes simplify address mapping, since the address of a line can simply be generated by taking the sum of size classes of lines before it. Similarly, subpages and full pages are also classified into different size classes. This paper assumes that there are also n size classes for both subpage and full page. The size class of a full page determines the physical storage the compressed page will take.
Cache lines and subpages are not stored compactly within their container units. A small extra space called the “hystereses” are padded after each block and subpage, such that if the size of the block of subpage increases a little, no global relocation will happen. For cache blocks, this means that if the size of the block changes after it is written and evicted from the LLC, no cache line needs to be moved, as long as the size increase can be accommodated by the hystereses area. Even if the increase of size in cache blocks trigger a cache line relocation, as long as the increased amount does not exceed the extra space after the subpage, no other subpage within the same needs to be relocated. Similarly, when a line shrinks in size after compression, the hystereses also prevents other lines or subpages from being relocated.
Size classes of blocks and subpages are encoded in page table entries. For n size classes, log2(n) bits are used in total to encode one block and subpage. The size class of the page is also stored as part of the page property. The paper did not mention how these extra information can be stored within the existing page table layout, in which only a few unused bytes are available for extension. When a TLB miss occurs, the page table walker fetchs both translation information and these size classes. The paper suggests that size classes can be stored either on the memory controller, or co-located with LLC in a buffer called the Block Size Table (BST). In the former design, the memory controller is responsible for compressing a block when it is evicted from the LLC and decompressing when fetched, while in the latter design, it is the LLC controller that should be responsible for these tasks. It should be noted that putting the BST on the LLC has the obvious advantage of fast reads of zero-filled lines, since no memory request is actually generated after the size of the line is determined.
Address translation between uncompressed and compressed address spaces happen when a line is fetched into or evicted from the LLC. The tagging logic is not changed, which still uses physical address in the uncompressed address space. When a block address is to be translated, the page entry in the BST is accessed. The translation logic first converts the sizes of all subpages and blocks within the same subpage before the target block to actual sizes, and then sum them up to eventually output the final offset of the block in the page. A few cascaded adders can achieve this with negligible area and cycle overhead. In the case of a LLC evict, the compression logic is also invoked in parallel to overlap some compression overhead. For line fetches, the decompression logic is invoked after the line is read from the memory.
When a line is evicted from the LLC, if the compressed size is smaller than the original size, but within the hystereses, no relocation is needed, as discussed above. If, however, the size reduction is beyond the hystereses, then existing lines and subpages after the target line should be relocated, if such relocation can cause a size reduction of the page. The paper actually did not talk much about size reduction, and only simply assumes that the page is compressed once it is fetched from the disk. If the compressed size is larger, and cannot be accommodated by the hystereses, then existing lines and subpages are relocated by shifting them toward higher addresses. The operating system maintains a page pool of different size classes. If the size class of a page changes after relocation, an exception is raised to the OS for relocating the entire page to a different size class. In other cases, the memory controller can just perform DMA writes to relocation cache blocks in the background. BST entries are also updated if any of the size class changes for the page. The updated BST entry should be written synchronized to the page table (in either write back mode or write through mode) to ensure correct future translation. If multiple BSTs are present, their contents should be kept in consistent with each other using a mechanism similar to cache coherence and/or TLB shootdown.
When data is being relocated within the page, all accesses to the page should be blocked by the controller to avoid tricky race conditions. This might harm performance, since it incurs stalls that are not usually present on an uncompressed main memory. The paper argues, however, that this only happens when a dirty block is evicted from the LLC. According to the replacement algorithm, this suggests that the page is not recently accessed by upper level caches, and therefore it is unlikely that an fetch request is generated from the upper level as well.