Enabling Transparent Memory Compression on Commodity Memory
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/8675200
Year: HPCA 2019
Keyword: Memory Compression
This paper presents Practical and Transparent Memory Compression (PTMC), an OS-transparent DRAM compression scheme running on commodity DRAM modules, which has low metadata overhead. The paper points out that prior DRAM compression proposals suffer from the following overhead or difficulties. First, in order to locate a cache line sized block in the DRAM module given a physical address, the compression scheme needs to translate the physical address into the actual address that stores the compressed line, since a compressed line may not be stored in its home location. Depending on the degree of associativity (i.e. the number of possible locations a block may be stored), the translation needs to be performed using a translation table of various complexity. The translation table inevitably incurrs two types of overheads. The first type is bandwidth overhead, since each DRAM access needs to first access the table to determine the location of the line. Such bandwidth overhead may offset the bandwidth benefit brought by compression, as pointed out by the paper. The second type of overhead is storage overhead, especially when the compressed working set is large. The paper estimates that even if each line only needs 1 bit storage (the bare minimum, which is the case for 2-way associative placing schemes), the total table size can still be as large as 32MB. These two types of overhead makes the translation scheme clumsy to deploy and use.
The second challenge is the Operating System. Some designs do not use a dedicated translation table. Instead, they delegate the task of mapping uncompressed physical addresses to DRAM addresses to the OS, and let the OS determine the mapping, which is then written into the TLB or the page table. This way, The OS needs to be notified on every compression status change, and the TLB entries are modified. This introduces several issues such as TLB coherence, OS overhead, and software compatibility.
The last challenge is to implemente a compression scheme suitable for commodity DRAM. Some proposals seek to change DRAM access protocol to allow smaller or larger blocks be bursted. This requires not only a re-design of the DRAM controller, but may also introduce compatibility issues for commodity systems.
The goal of PTMC is to reduce memory bandwidth consumption by taking advantage of free prefetching. If two cache lines can be compressed and colocated in the same block, when either of them is read out using the normal access protocol, the other one can also be fetched to the LLC for free, achieving the effect of near-by block prefetching. The paper suggests that the prefetched block should only be inserted into the LLC to avoid disrupting locality in upper levels. Note that the DRAM compression scheme does not attempt to conserve memory by utilizing extra slots due to compression. Marker values are stored in these unused slots in order for reads to redirect correctly, as we will see below.
Cache blocks in PTMC has limited associativity to reduce the complexity of address translation. Starting from address zero, each four blocks form a group. There are three cases. In the first case, neither block 0, 1 can be stored in one block after compression, nor do block 2 and 3. In this case, all four blocks must be stored uncompressed in their home locations, and no address mapping is required. In the second case, either block 0, 1, or block 2, 3, or both, can be stored in a single slot after compression. In this case, we store compressed block 0, 1 in the home address of block 0, and/or compressed block 2, 3 in the home address of block 2. Other blocks, if any, are stored in uncompressed form in their home locations. In the last case, all four blocks can be stored in a single slot. In this case, we store all compressed blocks in the home address of block 0.
All blocks in the group except block 3 has at most two possible locations, meaning that we can always find the block to read in at most two probes, one to the home address, the other to the alternate address. Block 0 can only be stored in its home address. Block 1 can be stored in either slot 0 (2:1 or 4:1 compression) or slot 1. Block 2 can be stored in either slot 0 (4:1 compression) or slot 2. Block 3, however, has three possible locations. Despite its home location, if block 2 and 3 are compressed together (2:1 compression), then block 3 is stored in slot 2. Otherwise, in the case of 4:1 compression, block 3 will be found in slot 0.
In order to find block 3 with at most two probes, the paper proposes that we store an invalid block marker if a slot is unused due to compression. For example, in the case of 4:1 compression, all slots expect slot 0 will be written an invalid marker, which is a 64 byte sequence randomly generated at boot time (collisions are dealt with in later paragraphs). With invalid marker, block 3 can be located with at most two probes. We always probe slot 2 first. If slot 2 contains an invalid marker, then it must be a 4:1 compression, and block 3 can be found in slot 0. Otherwise, if slot 2 contains an uncompressed block, we know block 3 must reside in slot 3, since otherwise it would be stored in slot 3 in compressed form. If slot 2 contains compressed blocks, then it must be the case that block 2 and 3 are compressed together in slot 2. Although the paper suggests that a predictor always be used to determine the order of access, this cannot be applied to block 3 in each group, since we may need three probes in order to read the block correctly in the worst case. Another improvement is to have two different types of invalid markers. The first type indicates 2:1 compression has been applied to the slot data, and the second type indicates 4:1 compression. This way, we can also always figure out the correct slot to read using the predictor within two probes.
In order to reduce the number of probes, especially given that the extra probe can severely reduce effective bandwidth, PTMC proposes using a predictor to determine which address to probe first (home address or alternate address; for block 3 it is home address and slot 2 address). The underlying conjecture is that in most cases, cache lines on the same page have similar compressability. The predictor, therefore, consists of a table of 2-bit values, indicating the last compression status of an access. The physical address is first hashed to one of the entries of the table, and uses the predicted location to access the line. If the prediction is wrong, the table is updated, and the alternate location is accessed. The paper reports that the predictor has an accuracy of 98%, which suffices to eliminate the second probe in most cases.
PTMC uses in-line markers to indicate the compression status of the line if it contains valid data (unused lines are filled with invalid marker discussed above). The marker is a 4-byte word of a specific pattern, which is generated during boot time. By placing the marker at the last 4 bytes of the slot, the total effective storage for holding compressed lines are reduced to 60 bytes, which, according to the paper, only marginally decreases the opportunity of compression. When two or four lines are compressed into the same 60 byte block, we write the marker at the end of the block based on the compression ratio (4:1 or 2:1). When the block is accessed, the DRAM controller will first check whether the block is compressed, and then decompress them using the corresponding algorithm.
Ordinary, uncompressed data may happen to collide with the in-line marker by having the last four bytes matching the in-line marker. Although the paper claims that the chances are really low for randomly generated markers, this must be handled properly to avoid data corruption. The paper proposes that whenever a cache line is evicted from the LLC, the controller always check whether its last four bytes match one of the two markers. If true, the bits in the line is inverted, and the address of the line is inserted into a small on-chip table. When the line is accessed later from the DRAM, the physical address of the line is checked against the table. If the address hits, the content of the line read from DRAM is inverted to restore the logical value. Given that the chances of collision are really slim, the table need not be very large, and the paper suggests that 16 entries may just suffic. In very rare cases of a table overflow, the DRAM controller may choose to dump the table into a memory-mapped area to ensure correctness.
PTMC also modifies the line replacement protocol in the LLC by forcing all cached lines in a compressed slot be evicted when any of the line in the slot is evicted. In the worst case, one LLC eviction may trigger three more evictions, which slightly reduces the locality we can exploit, since these are adjacent lines in physical address space. Note that the “gang eviction” does not necessarily incur more bandwidth, as long as the group can still be compressed and stored in the same slot. If, however, after the eviction, the compression ratio decreases from 4:1 to 2:1 or even to 1:1 (uncompressed), more than one slots need to be updated for the eviction, worsening bandwidth consumption. “Gang evition” also prevents a read-modify-update sequence when evicting a line belonging to a compressed group, since the invariant is maintained such that either somes lines are present in the cache at the same time, or none of them is present after an eviction.
PTMC may decrease performance by turning clean evictions into memory writes if the evicted line is clean but compressable. In addition, the gang eviction mechanism also introduces some extra writes if the evicted group cannot be stored in the same slot after compression. Both will incur extra traffic to the DRAM, which may decreases system throughput. To deal with such performance degradation, the paper proposes Dynamic-PTMC, which leverages set sampling to determine whether PTMC is turned on or off for a particular period of time. 1% of the LLC sets always use PTMC, and they monitor events that increase or decrease system performance. The remaining 99% sets of the LLC turn PTMC on or off depending on the actual performance of the 1% samples. When a bandwidth benefit event occurs (e.g. a prefetched line is hit), a 12-bit saturating counter is incremented. When a bandwidth cost event occurs, the same saturating counter is decremented. The status of PTMC is solely dependent on whether the MSB of the counter is 1 or 0. The paper reports that the dynamic scheme could attain less than 1% performance loss in all benchmarks, even those with low locality and cache line reuse.