MemZip: Exploring Unconventional Benefits from Memory Compression



- Hide Paper Summary
Paper Title: MemZip: Exploring Unconventional Benefits from Memory Compression
Link: https://ieeexplore.ieee.org/document/6835972
Year: HPCA 2014
Keyword: Compression; Memory Compression; Subranking; Memzip



Back

Highlight:

  1. Taking advantage of shorter burst length with subranking to fetch partial block, which fits into the paradigm of memory compression very well

  2. Letting OS not allocating the page that stores metadata rather than letting memory controller performing complicated address mapping

Questions

  1. I don’t get why you don’t just store the burst length in the header of the compressed line. In this case you just burst the first 8 bytes, check the number, and read the rest with more bursts (the cache is still needed, though). By placing metadata at the end of the current row, you have to activate the ranks that store metadata, which takes one more activation. Furthermore, even if it is stored at the end of the row, you still have to burst at least 8 bytes to fetch metadata, which has no obvious advantage.

  2. The metadata scheme creates “holes” in the physical address space. It is fine for OS to reserve pages in 4KB and 2MB scheme, since the hole only occurs every 128 * 8KB = 1GB. But for 1GB super pages it does not work. The controller should really just mask off these pages and perform address translation on them to avoid paging issues.

This paper proposes MemZip, an application of memory compression technique that aims at improving performance and power efficiency of the memory system. The paper points out at the beginning that most existing memory compression schemes aim at improving storage efficiency via flexible allocation of variable sized pages and clever placement of data. These schemes are not always optimal for several reasons. First, although cache lines become shorter bacause of compression, the DRAM access protocol still bursts 64 bytes as in uncompressed memory, which over-fetches the line. The paper argues that over-fetching only makes sense if adjacent lines are also stored in compressed form together, and when access locality is high. The second reason is that memory compression schemes often need to maintain large amount of metadata in order to perform address remapping from uncompressed address space to compressed address space. Since compressed cache lines are no longer stored on their home addresses, metadata access must be serialized with data access, which are both on the critical data access path. Although previous designs also propose metadata cache for hiding such extra cost in most cases, the effectiveness of metadata caches are usually limited to smaller working sets, since metadata entries tend to be large (e.g. 64 bytes per 4KB page). The last, less mentioned reason is that memory compression does not work well with ECC, since most previous schemes do not store ECC data explicitly with compressed lines. As cache lines can be stored arbitrarily on any location, ECC bit accesses will likely not result in row buffer hits with compressed data. This both complicates ECC memory design for ECC protected systems, and adds extra latency and power consumption on the critical path, which are hardly evaluated in previous works.

MemZip solves the above issues by not seeking to reduce memory footprint, but merely aiming at reducing bandwidth and power consumption and staying compatible with existing ECC schemes. Compressed cache lines are always stored in their home locations, eliminating the need for address translation. In addition, only the compressed line body is transferred over the memory bus at 8-byte granularity instead of the original 64 byte granularity, which saves bandwidth and energy when the compressed size is significantly smaller than 64 bytes. The paper also proposes mechanisms for storing ECC bits and other special encoding bits in the “padding” space at the end of the 8-byte word that are not used due to alignment. These extra bits can further reduce ECC access cost, or optimize bus data transfer.

The MemZip architecture is based on DRAM subranking, which is a technique for reducing access graularity in the DRAM access protocol. The paper assumes DDR3 interface, in which all DRAM accesses will involve eight consecutive bursts of 8 byte each, fetching an full 64 byte cache line in one access request. The number of bursts, however, is not changable in DDR3 protocol, which is unfortunate, since a memory access must then always fetch 64 bytes as long as the granularity of a single burst is eight bytes. On the other hand, memory subranking reduces the access granularity to one byte by selectively activating only a subset of all ranks in the memory. By adopting memory subranking, each DDR3 burst sequence only fetches 8 1-byte blocks, totalling to 8 byte per access. Previous publications have pointed out that subranking is helpful in reducing power consumption and improving performance, at the cost of more access transactions and longer access latency. In this paper, memory subranking is leveraged to fetch compressed data, which may use significantly less storage than a full 64 byte cache line. In this case, subranking not only saves bus bandwidth by not over-fetching unused bits from the memory, but also improves performance, since shorter compressed lines only require a few burst cycles to read.

MemZip operates as follows. Cache lines are compressed and decompressed when it enters and leaves the memory controller respectively. MemZip does not specify any particular compression algorithm to use, but BDI and FPC are suggested to be used together for reasonable compression ratio and fast decompression (the paper claims that both can be implemented with 1 - 2 cycles decompression latency). A compressed line is always stored at its home location without any address translation. Read and write accesses therefore always use the physical address generated by the on-chip MMU. Cache lines written back from the upper level are compressed with both BDI and FPC in parallel, and the better result among these two are selected. Compression metadata is stored in the first byte of the home address. The initial three bits indicate the compression scheme (BDI uses seven of them, and the last is left to FPC), while the rest 5 bits store DBI information which we discuss later. If DBI is used as suggested by the DBI field in the metadata byte, the next one to three bytes store DBI bits, which must be read before any data transfer since DBI optmizes data transfer via bit flipping.

Dirty cache lines may find itself larger than the old version after compression, which requires no extra handling, unlike previous proposals in which the line may “overflow” to a special region or compaction must be performed in the background.

Recall that MemZip reads compressed lines in a number of bursts. This number is stored in a seperate, per-row metadata area. Assuming 8KB DRAM pages and 8 byte burst size, each compressed line needs four bits to represent the number of reads for accessing the line body. Zero reads means that the line is all-zero which requires no extra DRAM access. Eight reads means that the line is stored uncompressed in the raw format without the metadata header (otherwise the line would be longer than 64 bytes). Since there are 128 lines per page, consuming a total of 512 bits or 64 bytes to store the metadata for all of them, the paper proposes that the last 64 byte block in each DRAM page be used to store the burst length. The 128th line of each page is remapped to a deidcated overflow page. One overflow page is allocated for every 127 DRAM pages. To perform the address mapping, the memory controller simply divides the page number by 128, and uses the last page in the page group to access the 128th cache line within the page. The OS should reserve the last two 4KB pages for every 256 4KB pages to avoid using the metadata page. A metadata cache is also added to the memory controller for fast accesses of recent pages. The metadata cache stores metadata in 64 byte granularity, with each entry covering a full 8KB DRAM page.

The paper proposes two extra optimizations based on the observation that compressed data body is not always aligned to the 8-byte burst boundary. In this case, the last 8-byte word read from the DRAM contains useless “padding” bits at the end, which can be leveraged to store some extra information, such as ECC and/or DBI. Conventional ECC uses a 9th chip per rank to provide an extra 8 bits for every 8 byte data. In a conventional scheme, the extra chip is activated together with the remaining 8 chips, such that ECC accesses will not incur extra latency. MemZip, on the contrary, does not activate all chips for reading compressed data, requiring an extra activation on the subrank that stores ECC if memory protection is enabled. The paper proposes using the extra padding bits at the end of compressed data body to store ECC bits as an optimization, since ECC is essentially read out free with no overhead at all. if the number of bits is not sufficient for that purpose, the ECC bit is still stored in the original form and location.

The second optimization is DBI, a technique for avoiding bit flips on the data bus when tramsnitting byte-by-byte. The observation is that the dynamic energy of each bus transfer cycle is related to the number of bit flips from the previous cycle to the current cycle. If the number of bit flips exceeds four, then instead of transmitting the original value, it is suggested that the value be bit flipped before transmission. In practice, since the receiving end has no way to know whether the value is bit flipped or not, we store an extra bit together with data to be transmitted, indicating whether the byte should be transmitted with bits flipped. This extra bit can also be stored in the padding area, but at the beginning of the compressed body, pushing the compressed body towards the 8-byte boundary. In addition, the bit does not need to stored for each byte, if space is a concern. Several configurations are allowed, such as stoing one bit per two bytes, four bytes, etc. The DBI configuration bits in the metadata header of each compressed line encodes the DBI configuration, which is used to interpret the DBI bit vector that follows.