Transparent Dual Memory Compression Architecture

- Hide Paper Summary

Paper Title: Transparent Dual Memory Compression Architecture
Link: https://ieeexplore.ieee.org/document/8091246
Year: PACT 2017
Keyword: Compression; Memory Compression; DCM; Dual Memory Compression

Back

Highlight:

Using two compression schemes to leverage access locality; Addresses with low access probablity are compressed with more heavy-weight algorithms
Not using overflow space is an improvement to LCP. This optimization will result in lower compression ratio, but in this paper it is fine, because it is only used for a small fraction of recently accessed data

Questions

This paper is poorly written, both syntactically and grammartically. Terminologies are used before being introduced (e.g. “slice controller” in Fig. 5). Also I do suggest that authors use less “the” in the text.
The design can apply to conventional DRAM without any problem. Why using a separate section to describe HMC? I know Samsung published many research papers on HMC, but this paper is really not a good place to advertise.
The mapping table translates with 1KB pages, but DMC compresses with 32KB in LCP. The resulting table structure allows multiple entries being mapped to the same base address. How do you cache these duplicated entries (might waste space if they are all cached; or if only caching the first one, how do you know the size of the mapping before looking into the compression metadata)? Or there are two caches for different sizes?
The paper mentioned free list allocation, but does not talk about what if one free list has been depleted while other lists still have free blocks. Do you split or merge chunks as in a buddy system? Or just declare failure (which is highly infeasible since not all sizes are used equally, not even roughly)?

This paper proposes Transparent Dual Memory Compression (DMC), a compression main memory design using two separate compression algorithms to optimize for space efficiency and access latency. The paper makes the observation that most cache and memory compression algorithms are designed with low decompression latency, since decompression is often on the critical access path. These algorithms, however, produce compressed data with lower compression ratio when compared with classical streaming compression algorithms designed for large files, such as LZ. These streaming compression algorithms, on the other hand, perform badly on small blocks. In addition, the paper also observes that the transparency of the compression scheme affects system design. For example, in a system where the OS explicitly manages compressed address space, the address translation between VA and the compressed address space must be intsalled by the OS, and performed by the MMU. This design has a few disadvantages as follows. First, whenever the compression status of a page changes, e.g. when the page migrates to a different address, which is not entirely uncommon due to the possibility of recompression or compaction, TLB shootdown must be used to keep the entry synchronized in all local TLBs. Second, since physical address is no longer linear, and the translation must be performed by MMU, bus devices that do not have an MMU, such as DMA devices, are unable to access compressed memory due to the inability of translating uncompressed PA to compressed PA. Third, OS managed compressed pages conflict with other virtual memory techniques, such as huge pages, the compression overhead of which is infeasible.

DMC is an OS-transparent, dual compression algorithm scheme for achieving both ease of deployment and high compression ratio with low latency. Instead of applying a single algorithm to each individual block, DMC classifies memory blocks into cold and hot blocks, based on access locality of the working set. Cold blocks are compressed with the slower but better compression algorithm, LZ77, while hot blocks are compressed with faster but “worse” algorithm, LCP, to enable fast access to individual cache blocks. Bus transactions, including OS’s memory accesses, assume an uncompressed physical address space. Translation between uncompressed and compressed physical address is performed by the memory controller and an in-memory mapping table. A translation cache is also added to filter out most accesses to the translation table.

DMC works as follows. The mapping table resides in a static location (not specified, but preferably at address zero) in physical DRAM, which is not mapped by the memory controller to physical address space. The memory controller handles address mapping and OS initialization on the memory subsystem. The paper suggests that a compression ratio can be assumed at system startup time. The amount of usable physical memory is then computed using the preliminary compression ratio and reported to the OS. During execution, if the monitored compression ratio deviates from the initial assumption, physical memory is recalculated, and then reported to the OS again. Modern OSs are equipped with corresponding modules for hot swap of DRAM modules. For example, to deal with shirnking physical address space, the OS could install a balloon driver which requests for physical memory on behalf of the process, causing inactive pages of the process to be paged out via normal virtual memory management. The released physical pages are then eliminated from the allocatable page pool. On the other hand, if the address space is inflating, then the OS simply adds more usable address pages to the page pool. The mapping table translates physical addresses in 1KB granularity, although the physical address can be compressed in a larger, but still aligned granularity. For example, DMC compresses with LZ77 in 1KB blocks, while with LCP it compresses in 32KB continuous ranges, which is larger than the original LCP proposal (4KB). The mapping table is directly addressed with a bit slice from the requested address, but the paper does not elaborate how the table supports two different translation granularities at the same time.

To accelerate mapping table access, a metadata cache is added to the controller which is organized as a TLB-like conventional cache. Extra bits are added per entry to store compression type and compression metadata. On a memory request, the cache is first checked. If a miss occurs, the entry is fetched from the mapping table after evicting an existing one, if necessary. If the entry being hit is an LCP entry, the cache line is directly read from the particular offset, and decompressed before being sent to the upper level. If the entry is a LZ77 entry, the handling is a little bit more complicated. First, it is decompressed and sent to the upper level as in the previous case. Second, the memory controller also transforms this block and its 31 neighboring blocks in the same 32KB LCP range to LCP type by decompressing and recompressing them. More than one blocks are collected during the second step, due to the difference of compression granularity between the two types.

One extra bit per 512KB range is added to track recently accessed address ranges. Choosing 512KB as the access tracking granularity is a balance between metadata cost and recompression overhead. The execution is divided inti epochs consisting of hundreds of millions of instructions. Within an epoch, the tracking bit is set if the corresponding 512KB range is accessed. At the end of an epoch, the memory controller scans the bit vector, and selects zero bit ranges for recompression. Recompression occurs by first gathering all LCP pages in the 512KB range. This may involve accessing a few individual memory locations, since blocks in the same range are likely not stored together. Then these LCP pages are decompressed into a 512KB buffer. In the last step, the 512KB buffer is recompressed with LZ77 for higher compression ratio. The decompression and recompression can also be pipelined to increase the throughput of recompression. Transferration from LZ77 to LCP is performed when the LZ77 block is accessed, as discussed above. 32 LZ77 blocks are gathered from the compressed memory, and decompressed into a buffer before recompression happens.

The memory controller maintains free memory in free lists, which are initialized at system startup time. The paper suggests that blocks of size 64B to 1KB be used for LZ77, and blocks with size from 2KB to 32KB be used for LCP. No further details on free lists are mentioned in the paper.

The paper also proposes three optimizations at the end. First, zero cache lines can be optmized out by storing a bit vector in the translation entry. A set bit in the vector indicates that the corresponding block contains all-zero, which is not backed by actual storage. Second, DMC puts an upper limit on the number of recompressions allowed in a certain window to prevent system bandwidth being consumed by this process, which causes performance degradation. The last optimization is constant monitoring of performance and turning DMC off if it does not provide performance advantage. The memory controller tracks the number of translation cache misses, recompressions, hits on LCP and LZ77 pages respectively, and so on. It then calculates CPI with and without compression. If the latter is higher, DMC is turned off since it negatively impacts performance.