A Many-Core Oriented Compressed Cache



- Hide Paper Summary
Paper Title: A Many-Core Oriented Compressed Cache
Link: https://dl.acm.org/doi/10.1145/2830772.2830828
Year: MICRO 2015
Keyword: Cache; Compression; Log-structured Cache



Back

Highlight:

  1. Large block compression optimization, which is often only used for DRAM, can also be deployed with LLC using large log objects.

Questions

  1. If two addresses are mapped to the same log, and uses the same LMT entry, how do you know which address the status bits represent? A conservative approach is to always check tags as long as the LMT entry is valid, but in this case, storing MESI states is an overkill, as a single valid bit suffices.

  2. The paper does not mention how conventional per-line status bits are stored. Are they compressed with tags using BDI (which I suspect is, but BDI will be less effective since there are not only address tags but also status bits).

  3. I did not get the benefit of using log-structure updates besides the compression ratio benefit. By not overwriting cache lines that are stale, you basically create external fragmentation which is hard to get rid of without compaction. I do get the benefit of fully-associative mapping between LMT and log entry, and the large log object, though.

  4. The paper seems to suggest that by appending a block, we can individually encode the block and then append it (Sec. 3.1: The data is then compressed and appended to the log (as shown in Figure 5)). In earlier and later sections of the paper it is suggested otherwise by saying that all blocks are compressed as a stream.

  5. If LMT does not store address tags, how could LMT eviction logic know which tag to invalidate? Do you ensure that at most one tags in the log can be mapped by the LMT entry? How?

  6. In general, many important details are missing, and cannot be inferred from existing information. Can’t believe it is on MICRO.

This paper proposes MORC, a novel log-structured last-level cache (LLC) design optimized for high throughput but longer access latency. The paper identifies that on modern multicore architectures, the performance of a single core is less important than the overall system throughput. As in previous works, MORC takes advantage of cache compression to increase overall system throughput by providing a larger effective cache, allowing more accesses to be fulfilled by the LLC, rather than the more power-hungey and lower latency DRAM. The paper then proceeds to identify three critical problems that are common with existing cache compression schemes. First, most existing algorithms have very limited compression ratio, ranging between 2:1 and 8:1. The compression ratio depends not only on the effectiveness of the algorithm itself, but also on the over-priovisioned tags. In practice, it is difficult to achieve maximum compression ratio due to the size of the block and/or the number of over-privisioned tags. Second, most compression schemes adopt the segmented cache approach, in which data slots within a set are unified into a whole chunk, which is then divided into fixed size segments. Each segment can only be mapped by one tag and store part of all of its compressed data. If the size of the compressed block is not a multiple of segment size, the last segment cannot be fully utilized, due to internal fragmentation. Lastly, to reduce circult complexity, most compressed cache designs only allocate segments using minimum algorithm, such as random or FIFO. In addition, there are often other constraints on the placement of segments, e.g. all segments mapped by a tag must store compressed data in ascending order. These constriants will prevent a free segment from being used, called external fragmentation, which can only be resolved by compaction. The paper claims that cache compaction operation incurs major overhead for circuit area and power consumption, which is not advised. Both internal and external fragmentation reduce effective cache size, but can be allievated by reducing the granularity of tracking at the cost of more on-chip metadata.

MORC avoids all above three problems using log-structured approach to organize compressed cache lines. First, by compressing multiple cache lines together (512KB as suggested in the paper), dictionary information can be better utilized to encode duplicated entries. An average of 24x and a maximum of 64x compression ratio are reported by the paper. Second, a compressed log is stored compactly, with blocks within the log aligned to byte boundaries, reducing internal fragmentation to a minimum. Lastly, no entries in the log will be overwritten by a line fill response or write back request. Entries of individual blocks are always appended to the end of the log, eliminating the need of compaction. Furthermore, MORC allocates tag storage at the end of the log. Tags are also compressed to reduce the number of physical over-provisioning storage for storing tags of compressed blocks.

We next describe the overall organization as follows. Compressed blocks are stored as logs, which are allocated to distinct physical slots. The paper suggests that 512 byte logs be used. All blocks in a log are compressed together, and blocks can only be inserted into the log by first decompressing the log, append the block data, and then recompress it. The data array is implemented as an indexable array of logs. Each log also has an associated tag array, which is stored in another physical bank. Tags in each individual tag slot are compressed to leverage locality of tag addresses. Tags are stored in the same order as log data. Depending on the compression algorithm, appending a tag entry to the end of the tag store may or may not require decompression (in the latter case, some extra state data may be required to be associated with the tag array). Status bits and coherence bits are also compressed with tags, although not mentioned by the paper. To help locate the log object given a requested address, an extra Log Mapping Table (LMT) is used to map requested addresses to log objects. The LMT is implemented as a coventional direct-mapped tag array, except that the address tag is not stored with the LMT entry. Address tags are verified after the compressed tag of the log object is located, decompressed, and checked in parallel. Due to the absence of address tags in LMT entries, it is possible that the LMT entry points to a log object, but tag check fails, which we call an “aliased miss”. Two extra designs help optimize the lookup access. First, to allow fast detection of non-aliased misses, in which case decompression of tag arrays are not needed, each LMT entry also stores the status bits of the cache line it maps. The LMT lookup logic could then determine the status of the requested cache line if it is not mapped at all and the LMT entry is not in-use by other addresses. Second, the LMT entry lookup is performed as a column-associative cache with an effective associativity of two. Two different hash functions are used to map an address to two possible locations. The LMT lookup logic should check both for possible hits, which potentially doubles cache lookup latency. Note that the actual “associativity” of the MORC is significantly larger than two, since each LMT entry points to a log, and the cache block could be stored in essentially every offset of the log.

We discuss the operation of MORC as follows. On a read access, the address is first hashed to the LMT using the column-associative algorithm described above. The LMT status bits are first checked to ensure that the entry is valid and is possibly mapping an address. Then the tag array of the log object is decompressed to perform a tag check. If tag check fails, a miss is signaled, in which case a refill request is sent to the lower level DRAM. Otherwise, a hit is signaled, and the compressed block is decompressed as well, after which the corresponding block on the same offset as the tag is returned. Note that decompressed only needs to restore the entire block being accessed. The decompression circuit can be aborted after that block is fully decompressed.

Writes are handled differently than the baseline read. After the tag array is decompressed, if a hit is signaled, the tag is invalidated by setting its per-line status to invalid. Invalid tags are ignored in future tag comparisons. Misses are also possible, since the paper assumes a non-inclusive LLC which is sometimes the case for large multicore systems. In both cases, a new log object is selected from a few currently active logs, and the new block is appended to the log by first decompressing the log, then appending the log, and finally recompressing it. The compression circuit also monitors the status of the compressed log. If it is large than the physical slot size, then the compression process is immediately aborted, and a victim log is selected using FIFO and then evicted back to the DRAM. The new block is then compressed and written into the slot just released by the eviction. The status of the LMT entry is changed to M state. In addition, if the LMT entry does not map the current block, the entry is also evicted by first invalidating the tag entry in the log, and then writing back the decompressed block on the corresponding location. (I don’t get how they can find the log block mapped by LMT entry, since LMT does not track address tags). In all cases, the LMT entry is updated to point to the log object that has just been written. Line refill responses are handled similarly as writes, except that the status of the line is not set to M.

One optimization to the log-structured append process is that, when a victim is to be selected for eviction, the log object with all invalid entries should be selected with the highest priority, and no actual eviction is needed. This is similar to garbage collection in a log-structured file system, where stale data blocks can be reused without any data movement.

The paper also suggests that multiple active logs be maintained for append, each responsible for one or a few data types. This optimization is based on the observation that data of the same type can be compressed better due to their encoding and/or dynamic value range. This, however, requires software hints or heuristics for determining the type of data to be written.

The proposed compression algorithm, called Large Block Encoding (LBE), is based on C-Pack, which is a hybrid of pattern based encoding and dictionary based encoding. The input data stream is divided into 32 bit words. For each word, the hardware checks whether it can be encoded with fewer bits with a known pattern, or whether it is the prefix of an entry in the 32 bit word dictionary. Output is encoded such that both the compression type and data for recoverying the original value are included. The issue, however, with C-Pack is that blocks are encoded in fixed 32 bit granularity, which causes space inefficiency since each dictionary index will take 4 bits per 32 bit data, a 12.5% overhead. LBE is specifically tuned for encoding large blocks by using multiple dictionaries. The hardware implementation still keeps a physical 32 bit block dictionary, but three extra logical dictionaries for encoding 64, 128 and 256 bit blocks are also added. These dictionaries index into the 32 bit dictionary to provide short encoding of combinations of aligned 32 bit words in the input stream. To adapt to this change, the LBE algorithm processes the input stream in 256 bit blocks. Each block is hashed into the four dictionaries, and matches with the largest size are selected for encoding. Unmatched blocks are inserted into dictionaries of the corresponding sizes. The paper does not provide pseudocode for detailed description of the LBE algorithm (and I am not an expert in compression algorithm), details cannot be further recovered from the description.