Reducing DRAM Footprint with NVM in Facebook

- Hide Paper Summary

Paper Title: Reducing DRAM Footprint with NVM in Facebook
Link: https://dl.acm.org/doi/10.1145/3190508.3190524
Year: EuroSys 2018
Keyword: NVM; RockDB; NVM Cache

Back

Highlights:

RocksDB is based on log-structured merge trees and it can benefit from adding a DRAM-based block cache that stores data blocks from the persistent files. However, if we replace the caching layer with block-interface NVM, performance will degrade due to (1) Large block sizes causing excessive bandwidth consumption; (2) Repeated evictions causing shortened NVM lifetime; and (3) The software path of interrupt-based I/O becoming the bottleneck.
We can address these issues by (1) using smaller block sizes, with a two-level per-file indexing, shared dictionary compression algorithm, and block placement optimized for alignment; (2) admission control algorithm leveraging a simulated LRU list; and (3) using polling-based I/O instead of interrupts.

This paper presents MyNVM, a caching layer design for RocksDB that leverages NVM as a fast alternative to DRAM. To goal of MyNVM is to maintain an acceptable latency with RocksDB key-value store while replacing the DRAM caching layer with cheaper but slower block-NVM storage. MyNVM achieves the goal by carefully tuning the engineering aspects of RocksDB storage layer in order to match the performance characteristics of NVM devices. Consequently, MyNVM suffers only marginal slowdowns compared with systems using only DRAM as the caching layer while offering a significant reduction in storage device expenses. In addition, compared with the unmodified RocksDB using NVM as the caching layer, MyNVM demonstrates a clear performance advantage in both average latency and P99 latency.

MyNVM is built upon RocksDB, a key-value store engine that features high-performance writes by using log-structured merge trees. RocksDB consists of two levels of storage. The first level is the in-memory index mapping keys to values, which is implemented as a high-performance skip list. Most requests are satisfied by the in-memory index and hence do not need to query the next level. The second storage level consists of multiple layers of sorted key-value pairs stored on the disk (e.g., flash drives). Each layer consists of multiple files. Each file contains a subset of sorted keys that reside in the layer. A file consists of a metadata header, an array of data blocks, a bloom filter encoding the keys in the file, and finally an index. Data blocks within a file are 16KB in size and they just contain key-value pairs in sorted order. The bloom filter enables quick searches of the file. The index is constructed using the first key of each block and is the entry point of key lookups within the file.

Key insertions in RocksDB are performed directly on the main memory level and are hence very fast. Key lookups will first query the in-memory index structure. If the query misses, then lookups will turn to the second level. For each layer in the level, the procedure first performs a binary search across the files in the layer. After locating the file, the per-file index is then queried to locate the block after checking with the bloom filter. Finally, the procedure performs another binary search within the block. This process can repeat multiple times as the lookup process misses in the previous layer and proceeds to the next. The lookup fails if the key is not found in any of the layers.

As the in-memory index structure is becoming larger, key-value entries will be gradually migrated to the second level by generating a new file containing the entries from the memory index. Similarly, when a layer becomes overly large, at least one file is picked from that layer and then merged into the next layer just like how LSM trees are merged. RocksDB strives to keep the size of each layer within a certain threshold, and smaller layers (closer to the main memory) have smaller sizes than bigger layers to facilitate fast searches. Key-value entries, once they migrate to the disk, become immutable. Future updates on these entries will be committed in a log-structured manner. Stale entries can be naturally identified and removed during the merge process.

When RocksDB is deployed in production, DRAM is also used as a write-through block caching layer for holding recently used data from the files in block granularity. However, as NVM devices emerge as a cheaper (price per GB) but slower alternative to DRAM, it seems feasible to at least replace part of the DRAM cache with block-interface NVM devices as the second caching layer below DRAM. To study the feasibility of using NVM to build a level-2 caching layer, the authors conducted a few experiments and identified several challenges. First, NVM devices, despite being much faster than SSD, are still bandwidth-limited, offering around 2GB/s for reads. Besides, the bandwidth will drop further when reads and writes interleave, which aggravates the bandwidth problem. This performance characteristic is detrimental to caching as RocksDB uses relatively large block size (16KB). As a result, when block accesses hit the NVM layer, the access will bring an entire 16KB block from the NVM, which can consume excessive bandwidth and lower the performance. Second, NVM devices can only ensure a limited number of writes before they become unreliable. However, when used as a caching layer, these devices are expected to receive a high amount of eviction traffic from the DRAM cache, with many of the evicted blocks being cold. Storing cold blocks is unnecessary and shortens the lifetime of the device, which should hence be avoided. Lastly, as NVM devices are much faster than disks and SSDs, the conventional interrupt-based OS interface for accessing these devices becomes the performance bottleneck due to the relatively high overhead of serving the interrupt.

To address these challenges, the paper proposes a few techniques which we describe as follows. First, to alleviate the bandwidth problem with big block sizes, the paper proposes to use smaller blocks for storing the key-value pairs. However, reducing block size can affect RocksDB’s performance in several ways. First, the per-file index stores the first key in each block of the file. For example, when the block size is reduced from 16KB to 4KB, the number of keys in the index will also quadruple, which increases both the storage cost and runtime lookup overhead. To deal with the problem, the paper proposes to divide the index into multiple blocks that can be fetched to the caching layer separately and also add an extra level of indexing over the current index. This two-level indexing scheme enables the index to be partially cached and accelerates lookup operations. The second problem with using a smaller block size is its interaction with compression. RocksDB employs compression for blocks when they are stored in the second level. Changing block size, however, causes the compression ratio to drop as the compressing algorithm now needs to generate a separate dictionary for every block. To address this problem, instead of generating a separate dictionary for each block, the paper proposes to generate a shared dictionary by sampling the content of all blocks in a file when they are written back, after which compression is performed using the shared dictionary. The third problem is block alignment. Since blocks are stored in compressed form and the compressed size of different blocks can vary significantly, the blocks need to be stored aligned to the physical block boundary of the NVM device to minimize read amplification, i.e., the phenomenon that multiple physical blocks are accessed in order to serve a single misaligned block. However, the compressed size of 4KB blocks is usually smaller than 4KB. If these blocks are stored compactly, many of them would cause read amplification. To solve this problem, the paper proposes using 6KB blocks instead of the more appealing 4KB because empirical results show that many 6KB blocks can be compressed to approximately to around 4KB, indicating that they can be conveniently laid out on the NVM device by giving each compressed block its physical page while minimizing the internal fragmentation caused by zero padding.

To solve the second challenge, i.e., device lifetime issue with frequent evictions from the DRAM level, the paper proposes to adopt an admission policy that only selectively caches evicted blocks that are likely to be accessed in the future. The admission policy is implemented as a simulated LRU list for the NVM caching layer over the access trace. When a block is evicted from the DRAM layer, the simulated LRU list determines whether or not the block should be inserted into the NVM layer. If the block is in the LRU list, then it will be inserted into the NVM cache, and otherwise, it is directly dropped (recall that the caching layers are write-through).

The address the last challenge, the paper proposes to use polling instead of interrupts when performing I/O on the NVM device. On one hand, polling minimizes the software overhead for interrupt handling and kernel traps. On the other hand, however, polling itself may also potentially waste processor cycles if conducted too frequently. To prevent polling from becoming a performance burden, the paper proposes that the polling should only start after half of the mean I/O latency has passed, which is generally several microseconds. RocksDB will also monitor the latency of past I/O requests and update the polling delay accordingly.