Cocoa: Synergistic Cache Compression and Error Correction in Capacity Sensitive Last Level Caches



- Hide Paper Summary
Paper Title: Cocoa: Synergistic Cache Compression and Error Correction in Capacity Sensitive Last Level Caches
Link: https://dl.acm.org/doi/10.1145/3240302.3240304
Year: MEMSYS 2018
Keyword: Cache Compression; BDI; ECC



Back

Highlights:

  1. Cache compression and ECC code can be combined to support error correction and detection without sacrificing effective cache capacity. Broadly speaking, compression can be applied to many scenarios where extra data or metadata needs to be stored for the cache.

  2. Different ECC codes have different trade-offs. We can apply different code in a fine-grained, per-segment manner based on segment property.

Comments:

  1. The paper claims that fine-grained remapping/disabling approaches do not work well under high defect rates. Then later on in the paper, it is shown that actually the defect rate is not that high, with most words containing less than 3 errors. In the case where the defect rate is really high, Cocoa would not work well as well, because Cocoa falls back to the fine-grained disabling approach if a single word in a compressed block has more than two faulty bits.

  2. Actually, Cocoa does not need to store the begin segment of each compressed block in the tag entry, because (1) BDI metadata already implicitly encodes compressed size, and (2) the second block can just be stored backwards from the last bit.

  3. If a word has more than 2 defective bits, the data slot can still be partially used, if the defective bit only occurs at one half of the slot. This should be a likely case, since 2 bit errors per 32-bit are very rare. On a second thought, This adds extra bit per tag entry, but it only provides very limited benefit since the case is so rare that the capacity loss should not be noticeable.

This paper proposes Synergistic Cache Compression and Error Correction (Cocoa), a technique that enables the LLC to operate under low voltage while maintaining low error rate. Cocoa is motivated by the power benefit of operating caches on low voltages, at the cost of increased error rate. Cocoa addresses the issue with extra error correction and detection code stored in a few dedicated way of the data array. To counter the performance degradation caused by a smaller data array for strong data, cache compression is applied to the rest of the ways such that more logic blocks can be stored in compressed form. The resulting design enables the LLC to operate at a much lower voltage with low error rate, hence harvesting the power benefit, without hurting performance.

The main challenge of low-voltage cache design is the increased error rate, which is a natural result of randomness from the manufacturing process. While operating nicely on the normal voltage, small manufacturing variations on SRAM cells will become a bigger concern when the cache operates under a lower voltage, where some cells are more prone to suffer bit errors than other cells. To address this problem, several attempts have been made by prior works, which we describe as follows. First, prior works have proposed bigger and more complicated SRAM cell structure, which is more tolerant to random variations and hence perform better in terms of error rate under low voltage. These cells, however, generally require much more transistors per cell, which increases the area and power overhead of the SRAM cache, partially offsetting the purpose of low voltage operation.

Second, prior works have also attempted to dynamically disable or remap faulty SRAM bits that are detected beforehand. This task can be carried out in either coarse or fine granularity. In the former case, an entire line, set or way can be disabled if they are known to be error-prone. This often results in severe and unnecessary loss of capacity, since one single bit defect can cause a large chunk of non-faulty bit to be disabled as well. In the latter case, complicated hardware is required to remap SRAM cells at fine granularity. Besides, the paper claims that these techniques are not really effective with high defect density, as still suffer great capacity loss in these situations.

Previous works also propose adding extra redundancy to the data array, such that defective blocks or sets can be remapped to the redundant SRAM storage. While preserving the logical cache capacity, this approach is upper bounded the number of defects it can fix. When the number of defective bits exceed the maximum amount of redundancy, this approach would still fall back to the previous one, and suffer cache capacity loss.

Lastly, error detection and correction code has been applied to provide extra redundancy. The paper points out that a single type of error detection and correction code is often not sufficient, because of the rigid trade-offs between capability, hardware complexity, and storage overhead. For example, BCH code is optimized for storage overhead, but it is only feasible for correcting a small number of bits (1 or 2 bits mostly), due to the hardware complexity for encoding and decoding. On the other hand, OLSC code enables fast encoding and decoding, but it incurs larger capacity overhead, which can be as big as 50% of total cache capacity in some proposals.

Cocoa addresses the challenge by combing two different error correction and detection code with cache compression. Cache blocks are protected with a combination of weak SECDED and stronger OLSC in the unit of 32-bit words. One out of four ways in the LLC is dedicated to store ECC codes for the remaining three ways of data, resulting in a 25% capacity loss without compression. With compression, each data slot in the remaining three ways can store two compressed logical blocks, which increases the effective capacity of the remaining three ways by at most 2x. Cocoa can accomplish low error rate without paying any significant storage penalty with a moderate compression ratio. In certain cases when the compression ratio is high, Cocoa may even gain a performance advantage due to an even larger logical LLC capacity than a regular cache.

We first describe the compressed cache design. Cocoa adopts BDI as the compression algorithm, due to its simplicity and moderate compression ratio. Tag array entries are still statically mapped to data array slots, and each tag array entry contains two copies of address tags as well as state bits (including BDI metadata). The tag array always operates under normal voltage, to avoid tag entry corruption. Two compressed blocks can co-reside in a data slot if their compressed sizes do not exceed slot capacity which is 64 bytes. In this case, the first compressed block is stored at offset zero, while the second compressed block is stored backwards from the last bit of the slot. When a new block is to be inserted, either the cache controller finds an existing slot that has enough storage for it, in which case no eviction happens, or one or two blocks are evicted from a data slot (depending on the size of the newly inserted block), and the new block is inserted into the slot. Note that compression related metadata only needs to be added to every three out of four ways of the LLC.

We next describe the ECC design. Cocoa reserves one out of four ways to store the ECC code for the remaining three ways. ECC is maintained for compressed blocks at 32-bit word granularity. For each 32-bit word in the data slot, if the word is prone to 1-bit error, then the regular SECDED code is used to correct ont bit error and detect two bits errors for the word. Otherwise, if the word is prone two 2-bit error, then the more storage hungry OLSC code is used, which is capable of correcting two bit errors. The paper indicates that more than 2-bit error per 32-bit word is very rare, and can hence be handled by simply disabling the entire data slot. The tag entry has a “disabled” bit, which, if set, excludes the tag entry from associative lookups. To indicate which word uses which type of ECC code, every tag entry also stores a “side info” field, which is a bitmask whose size equals the number of 32-bit words in a slot (16, given 64 byte slots). Each bit in the “side info” mask indicates whether the corresponding word uses SECDED or OLSC encoding. In addition, the offset of the ECC bits of a given word can also be inferred from the “side info” mask.

To determine the number of error-prone bits per block, the LLC is tested during system startup using the Built-In Self Test (BIST) module. The module repeatedly writes a test pattern into the LLC, and reads it back to evaluate the number of defective bits in each words. The result of the evaluation is then written into a special memory region, which is later read by the LLC controller when the cache switches to low power mode. The cache controller also computes the “side info” and “disabled” field for every tag entry using the information.