Designing Hybrid DRAM/PCM Main Memory Systems Utilizing Dual-Phase Compression

- Hide Paper Summary

Paper Title: Designing Hybrid DRAM/PCM Main Memory Systems Utilizing Dual-Phase Compression
Link: https://dl.acm.org/doi/10.1145/2658989
Year: ACM Transactions 2014
Keyword: Compression; NVM; DRAM Cache

Back

Highlight:

Use different compression algorithms for data of different access frequencies and latency. For frequently accessed data stored in low latency devices, low latency algorithm is used for faster delivery. For less frequently accessed data, more complicated algorithm is used for better compression ratio, at the cost of more cycles. The argument is that NVM access latency is also larger, so larger decompression latency has minor effect on performance.
The low-latency compression algorithm only compresses adjacent identical words, which allows delivery of the critical word, which is a nice feature to have since it further reduces read latency.
The eviction policy uses multi-stage decision process, using cache line size, dirtiness, and LRU position respectively.
The local wear-leveling scheme shifts data (less than 64 bytes) within a line to evenly distribute writes over all cells.
The global wear-leveling scheme is based on page migration, and is built into the OS’s virtual memory system, which saves an extra hardware mapping structure.

Questions

What I cannot get is that, if smaller lines are favored for eviction, then multiple lines will be potentially evicted for an insertion, which affects effective capacity, as claimed by the paper. This, however, also does no good to the NVM, since more data will be written back also. Why the paper claims that this is good for NVM?

This paper proposes Dual-Phase Compression (DPC), a novel NVM-based memory architecture featuring memory compression with lower access latency and better NVM lifetime. The paper points out that previous proposals using memory compression face a few issues. First, these proposals usually radically modify existing cache hierarchy and memory hierarchy with specialized hardware, which brings compatibility and hardware cost challenges. Second, memory compression increases access latency, sometimes even significantly, to the main memory, which can become problematic when the compression algorithm is a general-purpose one. Third, most previous proposals only optimize for either bandwidth or endurance of NVM device, but not both. Lastly, these proposals may also introduce undisirable components, such as mapping tables for address translation, which can also potentially degrade performance, and complicate the design.

DPC addresses the above issues with the following design features. First, DPC is purely a modularized design, with the only component addition being the DRAM controller and NVM controller, which is transparent to the processor and cache hierarchy. In addition, it relies on general-purpose NVM architecture, without any specialized hardware. Second, DPC adopts the so-called “Dual-Phase Compression”, which uses two different compression algorithms for data of different characteristics. For frequently accessed data, DPC assumes they will be buffered by an L4 DRAM cache, and only uses a fast, lightweight compression algorithm, such that decompression latency is negligible compared with access latency. In addition, the algorithm itself is designed such that critical words can be delivered first to the processor pipeline, further reducing the length of the critical path. For data that is rarely accessed, they are further compressed using a second-stage algorithm, which results in both higher compression ratio, hence better bandwidth saving, and higher access delay. The increased access delay, however, is not a concern either, since NVM accesses also takes more cycles. Third, DPC combines compression with both local and global wear-leveling techniques to reduce repeated writes on the same NVM cells, achiving bandwidth saving and longer NVM lifetime at the same time. Lastly, DPC does not introduce extra levels of indirection, notably mapping tables, on the access critical path. Compression is performed not to reduce NVM storage, but to increase NVM endurance. This way, it is sufficient to delegate address remapping to the OS using the existing virtual memory system.

DPC assumes a system architecture as follows. The system is assumed to be equipped with both NVM as the main memory device, and a L4 DRAM cache for fast access and less NVM write traffic. The DRAM cache might be implemented with on-chip DRAM modules for even better performance. The paper, however, also noted that the design works equally well for systems without a DRAM buffer with only minor changes. Frequently accessed cache lines are brought into the DRAM cache. Any DRAM cache design would work with DPC, but the paper assumes a simple design where DRAM rows are used as sets, and all tags are stored in the same row as data. The DRAM cache can store up to 4 uncompressed lines per row.

We next describe the first-stage compression algorithm as follows. First-stage compression is performed when a cache line is evicted from the LLC to the L4 DRAM cache. This algorithm, as discussed above, should have low decompression latency, and support critical word delivery. The algorithm uses simple word-level matching between adjacent 32-bit words. During compression, a word on offset i is compared with word on offset (i - 1). If their values match, a mask bit “0” is output, and the word on offset i is removed from the output stream. Otherwise, a “1” bit appears in the bit mask, and word on offset i is copied to the output stream. The output of the algorithm consists of a 16-bit mask (where the first bit is always “1” since that word does not need comparison and is always stored), and 1 to 15 words. If the cache line after compression is larger than or equal to the original line, the uncompressed line is used. Decompression is just the reverse of compression. In order to decompress the word on offset i, the decompressor simply computes the prefix sum of the first i bits in the mask, and use that as the offset to access compressed words. This way, the critical word accessed by the load operation can be decompressed first, and bypass the cache hierarchy to the LSU. The rest is performed in the background, which will then be inserted into the LLC.

The L4 DRAM cache benefits from compression by storing more lines compactly in each set, which results in potentially larger effective size. The uncompressed DRAM cache design includes four logical lines per DRAM row, with four tags and the corresponding status bits. On each access, the DRAM controller reads the entire row in one access into the row buffer, after which the tags are checked. In the compressed design, however, since more logical lines are potentially stored, and the size of each line is not static, the paper proposes that: (1) The number of tags is extended to 16, enabling a maximum compression ratio of 4; (2) Compressed lines stored in the data region is no longer statically bound to tags. Instead, the data region is divided into 16-bit segments. A compressed line can start from any segment boundary, and each tag also contains a segment pointer and a size field for locating compressed data. Compressed lines are always stored compactly in the DRAM cache, which implies that if fragmentation occurs after a write or eviction operation, the DRAM controller should also compact the row and rearrange valid lines, before the row is closed.

The paper also proposes eviction algorithms for the DRAM buffer. Due to the fact that cache lines of variably sized, the classical LRU may not work the best for a compressed cache. The paper, for example, discusses two extremes. In the first extreme, larger lines (uncompressed lines) is preferred over smaller lines. This favors effective size of the cache, since more lines can potentially fit into the empty slot after eviction, at the cost of extra write traffic on the NVM, since evicted lines will be written back to the NVM. In the second extreme, smaller lines are favored more over larger lines. In this case, write traffic to the NVM can be reduced, but effective cache size might be impaired, since more than one smaller lines might be evicted.

As a balance, the paper proposes two-stage eviction. In the first stage, cache lines are selected based on sizes. Depending on the characteristics of the workload, this stage may favor uncompressed lines (optmize for NVM write traffic), or cache capacity (optimize for read-dominant workloads). Then in the second stage, cache lines are selected based on whether the line is dirty. Non-dirty lines are always perferred over dirty lines, since they require no NVM write. LRU information is still maintained as tie-breaker.

The paper then proposes the second-stage compression algorithm, which is applied to lines that have already been compressed in the L4 DRAM cache. The second-stage compression algorithm is very similar to FPC, which recognizes common bit patterns in the 32-bit words from the first-stage compression, and encode then using three-bit type field followed by data that is necessary for decompression. The second-stage compression algorithm takes more cycles to compress and decompress, but the latency is insignificant compared with NVM access latency.

At the end, the paper proposes two wear-leveling techniques. The local wear-leveling technique is based on compression, which aims at evenly wearing all bits within a cache line. The global wear-leveling is more irrelevent to compression, and is aimed at distributing writes evenly across 4B pages.

The local wear-leveling works as follows. Each cache line in the NVM is associated with a Line Counter (LC), which tracks the number of writes an entire line has been written. Compressed lines are still mapped to device addresses as if they were uncompressed, but instead of always storing the line at bit zero, the paper proposes that the line be shifted around the line on each write. To elaborate: The internal FTL of the NVM should maintain an offset and size field to track the offset and size of the compressed line. Lines are always stored on segment boundaries (four byte boundary) within a line. Whenever a write is performed on a line slot, the compressed line is stored right next to the last segment of the previous line, hence evenly distributing wtites over all segments in the line slot. Out-of-bound writes are wrapped around to the first bit of the slot. The NVM controller should shift the line after reading it out from the slot and before sending it to the decompressor.

Global compression is performed on page level. The goal of global wear-leveling is to bound the difference in write count between pages that are most frequently written, and pages that are least frequently written. DPC monitors page write history by taking the sum of the LC for all lines within the page (which is also a better metric than using a single counter per-page, as observed by the paper). The OS is responsible for reading these values on a page write, and determine whether a page migration is needed. If the current page being written is deemed to have too many writes, the OS will first copy the content of the page to another less-written page, and then update the page table mapping. This way, no extra mapping structure is needed, since the OS itself performs address mapping explicitly.