Attache: Towards Ideal Memory Compression by Mitigating Metadata Bandwidth Overheads



- Hide Paper Summary
Paper Title: Attache: Towards Ideal Memory Compression by Mitigating Metadata Bandwidth Overheads
Link: https://ieeexplore.ieee.org/document/8574551/
Year: MICRO 2018
Keyword: Attache; Memory Compression



Back

This paper proposes Attache, a main memory compression framework for reducing bandwidth using special signatures. Main memory compression is proposed in prior publications to serve two main purposes. First, memory compression reduces bus bandwidth consumption by transferring compressed blocks, which are variably sized and are smaller than uncompressed block, instead of fix sized cache blocks. Second, more blocks can fit into the same amount of physical storage, if compressed blocks are stored in a way that the extra space can be collected and taken advantage of. Conventional main memory scheme faces two major challenges. First, the compressed block needs to be relocated in the main memory in order for the space advantage to be leveraged. The memory controller should maintain a mapping table for locating compressed lines given their physical addresses. The second challenge is that compressed lines need to identify themselves such that the decompression engine could recognize the compression algorithm and process the line properly.

Attache focus on solving the bandwidth reduction problem without increasing effective memory size, and therefore the address mapping scheme is a rather simple one. Still, compressed lines should identify themselves to the uncompression engine, since some lines are simply stored uncompressed, while others have different compressibility. Conventional memory compression schemes often use a metadata cache to enable fast retrieval of metadata for recently accessed blocks. Metadata caches, however, have several problems. First, the cache cannot be very large, due the physical and budget constraint of SRAM caches. If the size of the working set exceeds cache size, then each memory access will incur two DRAM references: one for the metadata of the requested address, and the other for the actual data. The benefit of bandwidth reduction diminishes, as the number of DRAM operations is doubled. Even worse, in such a scenario, it is likely that the application is bandwidth hungry. Bandwidth reduction is the most critical feature in this case, but unfortunately the benefits are only marginal. The second reason is that a cache will incur load and eviction traffic when a miss occurs. These traffic will further consume bandwidth on the memory bus, since the final destination of metadata is DRAM. Lastly, with a metadata cache, metadata access and DRAM data access are serialized, since the controller does not know the exact location and/or the compression scheme of the requested address. The latency of DRAM accesses is therefore longer, which is also on the critical path of DRAM accesses.

Attache optimizes metadata usage in main memory compression schemes by getting rid of a dedicated metadata area and storing a special marker and a bit indicating compression status at the beginning of a compressed DRAM block. The memory controller will check for the marker after a block is read. If the mark is present, and the compression status bit indicates that the line is indeed compressed, the line is sent to the decompression engine for further processing. Marker conflicts, however, may arise if an uncompressed block happen to have the same bit pattern at the beginning of the block as the marker. To solve this rare coincidence, the bit after the marker is dedicated as the disambiguity bit, in a sense that if the marker exists, and the bit is set, then the block is truly a compressed block. Otherwise, if the marker is not found, or if the bit is cleared, the block is considered as uncompressed. For uncompressed blocks, since one bit in the block is used as the special bit, the actual value of that particular bit has to be stored elsewhere. In the paper it is suggested that a dedicated area in the DRAM be used as the storage for the extra bit.

One of the most critical observations is that metadata access is required on every memory request, which should be highly optimized and should be latency- and bandwidth-free. On the other hand, by using a marker value, most uncompressed lines can be ruled out by the marker comparison after fetching the line from the row buffer, since the probablity of marker conflict exponentially decreases with marker size, and thus marker conflict can be handled in a sub-optimal manner. In Attache, when a marker conflict occurs, an extra memory access is made to fetch the extra bit to restore the original content of the block. The performance cost of the extra access is negligible given that this case is rare.

Attache assumes a sub-rank memory access interface, in which data can be accessed in half-line granularity. This access mode allows parallel activation and read of the higher and lower sub-rank, each providing a half-line sized block. The observation is that if compressed blocks can fit into a sub-rank (i.e. at least 2:1 compression), then two compressed blocks can be accessed at the same time, potentially increasing the effective DRAM bandwiddth by 2x.

We next discuss the details of operation as follows. As described in previous sections, each compressed block is preceded with a marker value and a special bit for disambiguiation. The marker value is chosen as a 15 bit random string generated at system initialization time, stored in the DRAM controller, and used throughout the power cycle. A block written back from the LLC will be stored in the compressed form, if the compressed size is no larger than 30 bytes, since a sub-rank can store at most 32 bytes of data, and the first two bytes are occupied by the marker and the special bit. Any blocks larger than 30 bytes after compression will be stored uncompressed.

To increase the chance that two compressed lines can be accessed in parallel, the paper suggests that even row lines, if compressed, should be stored in the lower sub-rank, while odd row lines should be stored in the higher sub-rank. Uncompressed lines on even rows are stored in both sub-ranks as-is, but when the line is on odd row, the two sub-lines in lower and higher sub-ranks should be swapped, since the DRAM controller will not be able to know whether the line is compressed or not, before it reads the sub-rank that potentially contains the marker. In the case of odd rows, since the first half of the uncompressed line should be stored in the higher sub-rank, such that the DRAM controller could always read the higher sub-rank first and then check for marks.

On receiving a memory request, the DRAM controller first computes the sub-rank to be accessed, and then activates the row. After the row becomes available in the row buffer, the controller checks the first 15 bits of the row to see if it equals the marker bit string. If false, the sub-rank contains the first half of an uncompressed line. The other sub-rank is then accessed, and both halves are concatenated to form the full cache line. Otherwise, if the first 15 bits of the sub-rank matches the marker, the controller proceeds to check the next bit after the marker. If this bit is set, the rest of the sub-rank is treated as a compressed line, which will be sent to the decompression engine. In the rare case, the bit after the marker is zero, indicating that a marker conflict has occurred. The cache controller retrieves the missing bit from the dedicated area, the base of which is stored in a controller register. This will generate an extra DRAM read. After the bit is retrieved, the other sub-rank is also read, and the original content of the uncompressed block is restored by replacing the special bit with the bit read in the previous step.

The above access sequence serializes the access of two sub-ranks, if the cache line is uncompressed. Such serialization is largely unnecessary, because current DRAM interface supports parallel read of the two sub-ranks. To better take advantage of the parallelism, the paper proposes reading both sub-ranks in the same time window when the cache line is uncompressed. This, however, is a circular argument, because the content of the block is needed to determine whether it is compressed, but without knowing the compressibility of the block, it makes no sense to read both sub-ranks in parallel. To solve this issue, the paper introduces three levels predictor for deciding the compressibility of a line. The top level predictor, called global predictor, consists of four 2-bit saturating counters. The global predictor is used to determine whether compressibility is globally feasible using the last four access results. Addresses are direct-mapped to one of the four counters, and the counter is incremented if the prediction is correct. On an incorrect prediction, the counter is set to value zero. The second level, called page predictor, tracks the compressibility of a page as a whole. The page predictor is addressed using the page number of the requested address. Page predictors are incremented when a prediction is correct. Otherwise, the value is copied from the value of the corresponding global predictor. The last level, called line predictor, tracks compressibility of lines using a bit vector. Each line predictor is a bit vector having one bit per cache block within a page, which is addressed using page numbers just like page predictors. When a line is predicted as compressible and is correct, the bit is set. Otherwise the bit is cleared. In addition, for each update in the line predictor, the page predictor of the line is also checked. If the page predictor indicates that the page itself is also compressible, i.e. most recent lines in the page are compressed, the two neighboring lines are also set in the line predictor, reflecting the fact that one compressible line in a page may also indicate the compressibility of adjacent lines with spatial locality.