BCD Deduplication: Effective Memory Compression Using Partial Cache-Line Deduplication



- Hide Paper Summary
Paper Title: BCD Deduplication: Effective Memory Compression Using Partial Cache-Line Deduplication
Link: https://dl.acm.org/doi/10.1145/3445814.3446722
Year: ASPLOS 2021
Keyword: Compression; Memory Compression; Deduplication; Inter-Block Compression



Back

Highlight:

  1. Use two-stage data compression. First stage is simple data clustering by hashing higher bits and taking the diff. Second stage is normal deduplication of the diff.

  2. Using higher bits and regular hash as the signature of data. This is a simple “content-sensitive hashing” which can effectively and efficiently cluster data with common higher bits.

  3. Using a single-level global translation table to perform address mapping for reads. This reduces critical path length to at most three accesses.

  4. Putting the engine in LLC, such that the LLC is also part of compressed address space, and that the LLC can serve as a unified cache for metadata and compressed data. This also helps increasing LLC size.

Questions:

  1. How to guarantee that all these complicated DRAM operations during a write are performed atomically? Using hardware 2PL is an option for correctness, but that would: (1) Lock an entire bucket (which is mostly fine, because lines are mapped to the same bucket by their contents, not address, so locality is not an issue); (2) incur more synchronization traffic, especially if there are multiple controllers. Note that this problem cannot be solved by address partitioning, since any two or more blocks can share the same base and diff. One way is to have a fast buffer for all buckets currently under-work. Controllers communicate via this shared “lock buffer” to avoid seeing an unfinished operation. Or you just serialize all operations, which I suppose would be the actual solution.

  2. How do you perform inverse address lookup during diff hash table compaction (from diff to translation table entry in order to rewrite the entry’s pointer)? The diff table does not store the original address, as in my understanding?

  3. The hardware heap maintained in the overflow region is not as simple as it seems to be. The heap allocator should support fine-grained allocation and free, not only page-sized allocation. MXT can do that because it compresses pages in really large units, such that both alloc and free are also potentially large. This paper’s proposal requires 16-byte level alloc and free, which is significantly more difficult (e.g., requires more metadata).

  4. OS needs to determine compressed size at startup time. Although many other designs also share this problem, BCD deduplication just makes the problem worse since there are two more hash tables, whose parameters should also be determined. Are they determined statically? By the BIOS according to memory size? By OS configuration?

  5. Dynamically disabling BCD deduplication is not always possible, since the home physical address may still be in-use by some other data or metadata, incurring very complicated reverse lookup and recursive decompression. Although, it is not a big problem, because you can just restart the machine.

This paper presents BCD deduplication, a memory compression technique using block clustering inter-block compression. The paper points out the limitations of previous deduplication and compression proposals as follows. First, many previous schemes rely on in-memory mapping table for address remapping, due to blocks being variable sized after compression. These mapping tables inevitably incur more memory accesses during both reads and writes, which will affect performance. Second, previous deduplication schemes, despite the fact that they can catch a wider range of block duplications by running on the entire address space rather than a cached subset, still underperforms stat-of-the-art compression due to the granularity of deduplication being too coarse, which is on cache block level. Lastly, most current compression algorithms either seek special word patterns within a block, such as BDI, FPC, and BPC, or rely on a dictionary to encode frequent words using less number of bits. In the former case, compression ratio is sub-optimal, despite low compression and decompression latency, since the algorithm only exploits redundancy in a cache block without considering inter-block redundancy. In the latter case, statistics must be generated in advance, and not changed frequently thereafter, such that correctness is guaranteed.

BCD deduplication, on the other hand, combines inter-block diff compression with deduplication. Its compression algorithm consists of two steps. In the first step, cache blocks that are likely to contain similar contents are clustered together into the same hash table bucket, and compared with each other. The bit-level diff is taken in a specific form such that the diff is only three-fourth of the size of an uncompressed block.

In the second step, the bit-level diff is then compressed, and hashed into another hash table to perform deduplication. This two-level scheme exploits both inter-line redundancy by only storing line diff, and intra-line redundancy by compressing the diff bits and performing deduplication of these diff, which are expected to yield higher compression ratio than a single-level scheme that only takes advantage of one of the two types of redundancy.

We next describe the operations of BCD deduplication as follows. Blocks in BCD deduplication are no longer stored on their physical addresses. Instead, blocks are stored as two parts: A base block, which can serve the common base for many similar blocks on different addresses, and a diff block, which can also be shared among different blocks. Address translation needs to be performed on each access, by mapping the physical address, which is used by the cache hierarchy, to a direct-mapped translation table. Each translation table entry consists of two address pointers and two bits. The address pointers are base pointer and diff pointer, which points to the base data and compressed diff, if exists. The two status bits indicate whether a diff exists, and whether the diff is compressed. Reads do not access any in-memory data structures other than the translation table. On a read operation, both pointers are used to fetch base data and compressed diff, if any, and then the diff is decompressed and combined with the base to generate the original block content. Read operations therefore require at most three DRAM accesses, upper bounding the worst case scenario.

The size of a translation table entry is 64 bits, in which each pointer uses 31 bits. Since base blocks are always stored in block-aligned addresses, 31-bit pointer can address as much as 128GB memory. Diff blocks are stored on 16-byte boundaries (reasons discussed below), and hence 31-bit diff pointer can address at most 32GB diff. Overall, the translation table occupies a constant 1/8 of total storage, since every memory block address has a direct-mapped entry in the table.

Write operations require detecting similar blocks and performing full or partial deduplication on these blocks. BCD deduplication uses a hash table to cluster cache blocks that have similar contents. The clustering algorithm takes higher 16 bits in each 64-bit word, combines them into 16-bytes segment called the “higher bits”, and then hashes it to a one-byte partial signature value. In the meantime, the full cache block is also hashed into a full signature. Both values will be used during the write operation.

The 16-byte higher bits are then used to generate the hash index into the base hash table. The paper did not disclose how the hash index is generated, but the effect of hashing higher bits is that blocks with the same higher bits will be clustered into the same hash bucket. Each hash bucket consists of N ways, which can support at most N different base blocks hashed to the same index. Each way consists of a one-byte partial signature, a one-byte full signature, a 2-byte reference count, and a 64-byte base block. Physically, these signatures and base blocks are stored consecutively, such that signatures can be accessed with one or more DRAM burst. In fact, in this paper, it is suggested that the hash table be 32-ways, such that one 64-byte burst is sufficient to access all signatures, and another burst can access all reference counts.

After generating the hash value, the deduplication logic then compares the partial and full hash with all hash values stored in the bucket respectively. On a full hash value match, a possible duplication is found, and the base block is read and compared with the incoming block. If it is indeed a full match, then we perform deduplication by not storing the incoming line anywhere (since there is an exact same line already in the bucket), but rather, the base block’s address is written into the translation table entry, and both diff bits are set off. The reference counter of the base block is also incremented by one. If, however, the full signatures match, but lines do not match, then partial signature is compared, as we see below.

If the full signature does not match, but the partial signature matches, then BCD deduplication attempts a partial deduplication by comparing the higher bits of the incoming block and the base block. In the case where they do match, the lower bits are gathered as the diff, which form a 48-byte segment. The lower bits will then be compressed with Leading Zero Compression (LZC) to get rid of even more redundant bits by removing higher bit zeros from the lower bits. The resulting diff is then deduplicated again, using a standard deduplication hash table. This secondary hash table is indexed with the hash of the LZC-compressed lower bits, and operates by deteching exact duplicates.

The secondary diff hash table is organized similarly. Each entry of this table also consists of a hash value field and diff value field. The difference is that, since LZC will compress the diff into three possible sizes: 48 bytes (uncompressable), 32 bytes, and 16 bytes, the diff hash table should be able to store all three kinds in the same data store. As a result, the diff hash table is segmented, and the compressed diff is always stored on 16-byte boundaries. A hash table bucket may require compaction, if the remaining storage in a way is sufficient for a new diff, but segments cannot be allocated due to fragmentation. In this case, all segments are read out and written back in a continuous manner. Translation table entries are also updated to reflect the change.

In all cases, if the address refers to a valid block or diff previously, and the block or diff is shared, the reference counts of the old bases and diffs should be decremented. If a reference count drops to zero, the block or diff will be freed.

If new entries cannot be allocated from either the base block table or the diff table (note that the hash table will not evict valid entries; The only way an entry becomes invalid is the the reference count dropping to zero), these data will be stored in an overflow region. The overflow region is maintained as a hardware heap, with several allocation sizes: 64 bytes, 48 bytes, 32 bytes, and 16 bytes, in which the last three are used for storing compressed diff. Blocks and diff in the overflowed region cannot serve as the base of deduplication, as they will not be searched during the write operation. These blocks are freed, if the block is written again, and the new block or diff can either be deduplicated, or be allocated an entry in the hash tables.

The paper proposes a few optimizations. The first optimization is that both the translation table and reference counts can be cached by a hardware structure. These caches do not need to be standalone entities. In fact, the paper proposes that the BCD deduplication entine be put between the L2 and the LLC, such that the engine uses LLC as a metadata and base/diff cache, and the LLC actually “sees” the translated memory address, rather than physical address. It is claimed by the paper that this also benefits the LLC since the LLC also potentially stores more data in compressed form without changing the LLC’s data layout.

The second optimization is to add a “lite” mode such that diff deduplication is disabled, but base partial deduplication is still performed. This is to reduce the extra overhead of deduplicating diffs when the diff is actually not compressable by deduplication. Under the lite mode, diff cache lines are not stored in the hash table. Instead, they are just written into the overflow region, as if the hash table is always full. Both read and write logic does not change for this mode, which simplifies hardware design as well.

Alternatively, BCD deduplication can be completely turned off during normal operation. This requires more intensive collaboration from the OS, since turning off BCD deduplication will disable the address translation function, meaning that all memory addresses currently active in the table should be decompressed and stored back to their physical address.