Translation-optimized Memory Compression for Capacity

- Hide Paper Summary

Paper Title: Translation-optimized Memory Compression for Capacity
Link: https://ieeexplore.ieee.org/document/9923870
Year: MICRO 2022
Keyword: Memory Compression; TMCC; Deflate

Back

Highlights:

The extra level of indirection from the physical address to the storage location can be embedded in the page table page by compressing the PTEs and making space for the translation entry.
The standard implementation of Deflate on ASIC is overly general-purpose and hence incurs high overhead. We can reduce the overheads by customizing it towards memory compression.

Comments:

This paper lacks a high-level overview of the design while focusing too much on implementation details. As a result, the overall picture of how TMCC works is not clear. For example, the paper did not give any description of how to detect accesses to compressed pages and how these pages are moved between ML1 and ML2.
It is not clear how TMCC would interact with 2MB huge pages. I can imagine the PTE trick to keep working. But how could 2MB pages be compressed with the customized compressor? Do you simply treat them as individual 4KB pages?

This paper proposes TMCC, a hardware-based main memory compression technique based on the existing OS-supported memory compression framework. The paper focuses on two aspects of the design to make it efficient. First, the existing hardware designs introduce an extra level of indirection between the physical address and the storage location of compressed pages. This extra level of indirection can incur great overhead on the access critical path and therefore should be reduced. Second, prior works on hardware dictionary compression proposed specialized ASIC compression and decompression engines. However, these engines are designed for general-purpose user scenarios and have an emphasis on compatibility, resulting in less efficient operations. The paper proposes a more specialized ASIC design that eliminates the overheads by tailoring the algorithm to fit into memory compression.

The work of this paper is based on a software-based memory compression approach implemented within the OS kernel (i.e., the OS-inspired approach in the paper). In the software-based approach, physical memory is managed at page granularity in two levels, namely memory level 1 (ML1) and memory level 2 (ML2). ML1 consists of physical pages that are uncompressed, and the virtual-to-physical mapping is set up normally as in an uncompressed system. Meanwhile, the ML2 consists of compressed physical pages, which are not directly mapped via the virtual memory system. Instead, virtual addresses that are mapped to ML2 pages rely on page faults to notify the OS kernel when they are accessed such that the OS will decompress them into a page frame and then set up the mapping, hence moving the page from ML2 to ML1. Due to the high overhead of accessing ML2 pages, the OS kernel only moves a page from ML1 to ML2 when the page is deemed “cold”, i.e., when the system is under memory pressure and the page is not accessed for a while. The detection of cold pages is performed using an LRU list called the “Recency List”. The Recency List is updated for a small fraction of accesses to approximate the actual access frequency (although the paper did not mention how it can be achieved).

In the software approach, the OS maintains two free lists consisting of physical storage at the page or sub-page level. The first list belongs to ML1 and always maintains storage at page granularity. This list can also be regarded as the OS physical page allocator. The second list belongs to ML2, and it is further divided into different size classes to accommodate compressed pages of different sizes. Physical storage can move around in both lists. When a page is moved from ML1 to ML2, it is broken down into smaller sub-pages of the desired size class. On the other hand, when free ML2 sub-pages form a whole page, the page can also be optionally moved to ML1.

The paper identifies two issues when the above mechanism is implemented on hardware. First, in a hardware implementation, there is no OS kernel to explicitly manage the address mapping between virtual and physical address space. Consequently, the hardware must manage the mapping instead, which inevitably adds another level of indirection between the physical address space and the storage location of the page. Unfortunately, this mapping lies on the critical path of memory accesses as well as regular address translation, as it must be conducted on LLC misses after the physical address is available. Prior works attempted to alleviate such costs by adding an extra translation metadata cache at the memory controller level. The metadata cache stores frequently used entries such that most translations can be performed with low latency.

The second issue is that by only compressing cold pages as in the software approach, the raw compression ratio of the algorithm must be higher than the scenario where all pages are compressed in order to reach the same overall ratio. Prior works have relied on the Deflate compressor and decompressor implemented on ASICs to offer a supreme raw compression ratio. However, the paper suggests that the actual implementation of the ASIC is still sub-optimal. For example, the circuit offers a moderate peak bandwidth but requires a high setup time whenever processing a new page that is independent of the previous one. As a result, the overall compression bandwidth is much lower than the peak bandwidth, which can also negatively affect overall system performance.

To address the first problem, the paper proposes that the translation entries (called CTEs) from physical to hardware locations should be embedded in the regular page table entries. The CTEs are then fetched during the page table walk by the MMU page table walker and then inserted into a special translation cache located next to the L2 cache. In order to embed CTEs into regular page table entries, the paper proposes to compress the regular PTEs on the page table page (called a PTB). The paper also noted that PTE compression works very well in practice, because (1) many pages, especially those located on the same PTB, will have the same permission bits, and (2) The higher bits of the physical address tend to be identical, especially for systems that do not use the whole physical address space. As a result, if the PTEs on a PTB are compressed, the extra CTEs that belong to the next level of translation can be embedded in the same PTB by occupying the storage freed up by compression. For regular pages that store data, the CTE contains status bits indicating the compression status of the page, as well as the storage location of the page if it is compressed. For pages that store PTEs, i.e., PTB pages, the compression metadata is maintained in the upper-level PTB. In this scenario, the metadata describes the layout of the compressed PTBs (because some PTEs may not be compressible), instead of storing the translated physical address (the physical address of the next-level PTB is always the one stored in the upper-level PTE). When the compression ratio is limited, only a small number of CTEs can be embedded into the PTB, which upper bounds the maximum number of pages that can be compressed in the address range represented by the CTE.

During a page walk, the MMU page table walker fetches both the regular translation entry and the CTE from the memory hierarchy. The CTEs are then inserted into a translation cache called the CTE buffer. The CTE buffer uses physical addresses as keys and translates the physical address to the storage location of physical memory. On regular memory accesses, the physical address is translated by the CTE buffer, and the resulting address is used to access the main memory. However, for page table walk accesses, the physical address is not translated by the CTE buffer. Instead, the same physical address is used to access the main memory as PTBs always reside in ML1.

To address the second problem, the paper presents a new Deflate implementation on ASIC that specifically optimizes for throughput. The proposed implementation differs from the standard implementation as follows. First, the paper observes that the LZ compression stage (the first stage of Deflate) requires a large CAM array for performing string matching. The CAM array serves as the dictionary that records a truncated history of bytes that have already been processed. In practice, however, since pages are only 4KB in size, smaller CAMs of size 1KB would be sufficient to handle most cases without losing too much compression ratio. Compared with the original 32KB CAM array, this reduced-size CAM design saves energy and enables lower access latency to the structure. Second, the paper also proposes using a 256-character alphabet for LZ, instead of the 286-character one for standard Deflate. Using the 256-character alphabet also makes the algorithm more efficient, as all output words can be encoded with 8 bits, rather than 9 bits. Next, the paper observes that a majority of the setup overhead of the prior work is on preparing the Huffman tree stored in hardware registers (which is part of the second stage of Deflate). To reduce this part of the latency, the paper proposes to use a smaller Huffman tree, such that this preparation phase can be shortened. As a result, the paper uses a Huffman tree of 16 elements, instead of the standard 286, with only marginal losses on compressibility. Lastly, the paper also observes that the two stages of compression, i.e., LZ compression and Huffman compression, do not overlap in the prior implementation. The reason is that the Huffman tree must be built with the full distribution of the characters in the input. To eliminate this limitation, the paper suggests that the distribution of characters in the input (which is the output of LZ compression) can be approximated using only a small fraction of the LZ output. As a result, in the final implementation, the distribution is computed using only the first few characters from the LZ output, after which the Huffman stage of Deflate compression kick starts and processes the output of LZ compression in a pipelined manner, hence reducing the overall latency of the operation.