Linear Compressed Pages: A Low Complexity, Low-Latency Main Memory Compression Framework
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/7847624/
Year: MICRO 2013
Keyword: Memory Compression; LCP
This paper proposes Linearly Compressed Pages (LCP), a main memory compression framework featuring easy algorithm design, low-cost deployment, and memory management. Broadly speaking, main memory compression schemes exist to satisfy one or both of the two goals. The first goal is to reduce memory usage by allowing cache lines or pages to be packed in a more compact form after compression. This way less physical memory can be allocated to a full 4KB page frame, and hence more pages can be mapped into virtual memory at the same time. The second goal is to save bandwidth when fetching cache lines into the cache hierarchy. Memory designs either employ compression as a prefetching scheme, allowing adjacent cache lines to be brought into the LLC in the same transaction to fulfill a cache miss, or redesign memory bus protocol such that less data can be fetched per-transaction, increasing bus transaction throughput.
The paper points out that both commercial memory compression schemes or proposals face several challenges that make the above two goals difficult to achieve. First, memory compression changes one of the most fundamental abstraction that the physical address space is linearly mapped to the virtual address space. The linear map not only simplifies physical address computation, but also simplifies memory management, since the OS can conveniently divide the physical address space into uniformly sized pages, and manage it as a fix-sized page buffer pool. With compression, pages can be variably sized, which introduces external fragmentation as in malloc(). Second, by compressing cache lines, we store them in a different hardware address as the translated physical address. This complicates the semantics of cache tags, as the cache tag used to indicate both the physical address and hardware address. By introducing compression, there are now three address spaces: Virtual address space, physical address, and compressed address space. Some designs, such as IBM MXT, explicitly define three address spaces, and require that the physical address be translated when DRAM is accessed. This translation typically lies on the critical path of DRAM operations, which can degrade performance, unless speculation is employed to overlap access with address mapping a bit. Speculation itself, however, may introduce extra complexity and verification cost. The last challenge is that even compression is conducted on a coarser granularity, such as pages, there is still the problem of locating a compressed line within the page. Due to the fact that compressed lines may not be of unified sizes, in order to access a line in the middle of a page, the DRAM controller should either compute the prefix sum of all previous cache line sizes, or per-compute and cache such information in a dedicated cache. This also complicates the design, and increases DRAM’s access latency.
LCP, on the other hand, avoids the above complexities by taking advantage of the following observations. First, cache lines on the same page typically have similar compressability, since they often store data of the same kind from the same process. It is, in most cases, beneficial to assume that these cache lines can be compressed into similar sizes. The second observation is that both address mapping and tag address mapping can be performed with trivial hardware if the mapping is linear, i.e. the hardware address of compressed lines can be decomposed into a base address plus the index multiplied by a constant. Both observations imply that if we sacrifice some efficiency of compression in exchange for uniformity of cache line sizes, the complexity of the resulting compression can be significantly lowered.
Compressed pages in LCP are sorted into a few per-defined sizes. The paper suggests using 512B, 1KB, 2KB and 4KB size classes, which correspond to different compression ratios (there is no fixed ratio as goals). The simplcity of page size classes also simplify OS memory management, since the OS can still manage the physical address space as a page buffer pool, expect that there are multiple size classes just like how malloc() works.
We next introduce the page layout as follows. Each compressed page, no matter the size, consists of three parts. The first part is compressed cache lines, linearly mapped into the page. The size of each compressed line is determined by the compression algorithm, and can hence be known when the algorithm is selected. Note every cache line in the page is stored in this part. For those who cannot be compressed to fit into a slot, they are classified as “exceptions”, and will be stored in the overflow area, as we will see below. The second part is metadata field, which contains an index to interpret the overflow area. The metadata field further consists of a bit field indicating which slot of the overflow area is valid, and an index array for each overflow slot, indicating the offset of the line in the page. The overflow area must be accessed by checking the bit field first, after which cache lines can be read from the compressed or overflow area. The last part is the overflow area, the size of which is determined by the page size and compressed cache line size.
LCP also requires adding extra information to the page table to help identifying the page type and locating metadata on the page. The paper suggests that the page size class, the compression algorithm (which implies the cache line size, since most algorithms have one or more target sizes), the sub-page base address (since compressed pages can be on a sub-page boundary), and the page size. These information are acquired by MMU page walker on a TLB miss, and stored in the TLB. Cache requests must also include compression type and page type information. When an LLC miss occurs, these information are also sent to the memory controller.
When accessing the DRAM, the memory controller operates as follows. First, it locates the address of the page, which is indicated by the upper level LLC (OS explicitly allocates pages on the compressed address space). The size and type of the page is also sent to the memory controller in the same request. Then, the controller computes the address of the metadata field using page size and type information, and reads the metadata of the page. If the intended cache line is stored in the overflow area, indicated by a bit in the bit field and a matching index in the index array, the line is read out and directly sent to the LLC. Otherwise, the line is stored in compressed area, and the cache controller needs to decompress the cache line using page type information in the request. Cache lines are stored in the cache uncompressed.
To avoid complicating cache tag computation, all levels in the cache hierarchy use a tagging scheme consistent with the compressed page, rather than uncompressed physical addresses given by unmodified TLB translation. The MMU is thus modified to output the base address of compressed page for each memory request. Note that although the offset is not changed by LCP, the physical address of a cache line may not be consistent with the actual physical address. The actual tagging scheme, therefore, has to use both the base page address and the offset of the line within the page.
To avoid the extra metadata lookup during DRAM access, the paper proposes adding a matadata cache to the memory controller which caches most recently accessed pages’ metadata field. Experiments show that for most workloads this cache has high hit rate, and thus can eliminate most metadata accesses, further saving DRAM bandwidth.
When a page is initially created or read from the disk (if it is previously swapped out), the memory controller first attempts to compress the page using one of the programmed compression algorithm. The memory controller then selects the best algorithm based on compression ratio. The page size is also selected such that it is the smallest size that can accommodate the compressed page. When a cache line is written back, the memory controller first attempts to compress the line using the current compression algorithm. If the line can still fit into the compressed slot, the controller simply writes it to the slot (and unsetting the overflow bit, if it is on). Otherwise, the controller writes it to the overflow area if it already exists. If, however, the cache line does not belong to the overflow area, meaning the line was initially in the regular compressed part, the controller allocates one slot from the overflow area, setting the bit, adding its index, after which the line is written into the overflow slot.
If the overflow area is full, and a new line is to be inserted into it, the memory controller raises an exception to the MMU, which further interrupts the OS. The OS either allocates a larger page to allow more overflow lines, or, if the old page is 2KB, the OS will stop compressing the page, and instruct the memory controller to decompress the line to a full 4KB flat page.
LCP also takes advantage of prefetching when the DRAM controller reads a line from the DRAM. The observation is that existing bus protocols always transfers the same amount of data per-transaction, which is hardcoded to be the size of an uncompressed cache line. LCP leverages this as a way of prefetcing adjacent cache lines into the LLC to improve cache hit rate for applications with good locality. To even increase the chance that a prefetched line will be actually used in the future, a stream prefetcher may be used at LLC level, which discards lines that are predicted to be not used. This can avoid cache pollution.