Pacman: An Efficient Compaction Approach for Log-Structured Key-Value Store on Persistent Memory

- Hide Paper Summary

Paper Title: Pacman: An Efficient Compaction Approach for Log-Structured Key-Value Store on Persistent Memory
Link: https://www.usenix.org/conference/atc22/presentation/wang-jing
Year: USENIX ATC 2022
Keyword: NVM; Pacman; Log-structured NVM; Key-Value Store; NVM

Back

Highlight:

The bottleneck of log-structured key-value store on NVM is the GC process, due to (1) small, random metadata reads and writes; (2) the use of persist barriers; and (3) excessive data copy.
Metadata reads and writes during GC can be reduced by keeping shortcuts, i.e., direct pointers into the index structure, as well as size information embedded into the index pointers.
Usages of persist barriers can be reduced by the two-stage GC process, where data copy happens first, and index updates happen the next. This process will preserve the consistency of the index structure since the index is updated using atomic CAS instructions.
Excessive data copy can be alleviated by separating cold and hot data, such that log segments containing hot data will likely contain only few valid entries when selected for GC. Hot data can be identified by a sampling of the access trace and sorting by the key frequency.

This paper presents Pacman, a software solution aiming at reducing the overhead of log-structure key-value store running on Byte-Addressable Non-Volatile Memory (NVM). The paper is motivated by the uniformly high overhead of garbage collection in log-structured key-value store designs when they are deployed on NVM. The proposed software solution, Pacman, addresses the challenge by adopting special software techniques based on the performance characteristics of NVM devices. As a result, Pacman improves the operational throughput of key-value stores even at high storage utilization.

The Pacman design is based on log-structured key-value stores where modifications to the existing key-value entries are implemented as log appends of the newer key-value rather than in-place updates. As a result, there can be multiple versions of key-value entries co-existing on the persistent storage. To maintain the correct write semantics, i.e., a read operation should always access the most up-to-date write, an extra index is added to the key-value store which maps keys to the most recently appended key-value pairs. Read operations must go through the index in order to access the correct instance of value, and modification operations also have to update the index to point to the newly appended key-value pair to commit the change.

While the paper does not specify a concrete implementation, and Pacman is generally applicable to virtually any log-structured key-value store design, it is suggested in the paper that logs are implemented as per-thread memory segments of MBs in size, and the index can be implemented as either a hash table or a tree structure. In addition, the index can be maintained in either volatile memory or persistent memory. Both options would work for a key-value store, and it barely affects the effectiveness of Pacman.

The paper notes that the most unique operation associated with log-structures key-value stores is garbage collection (GC). From a high level, the GC operation compacts a log segment by copying key-value entries that are still alive (i.e., being referenced by the index) to another log segment, freeing the part of storage occupied by entries that are dead (i.e., not being referenced by the index). Dead entries in log-structured key-value stores can be introduced by either explicitly removing an entry, or by modifying an existing entry. Consequently, in write-dominant scenarios, GC is a vital process for recycling persistent storage and will likely be invoked frequently. Unfortunately, when deployed on NVM, the paper observes that GC has become a major bottleneck for key-value stores, especially when the persistent storage utilization is high, as evidenced by the figure presented in the paper. In addition, adding more background threads to perform GC does not alleviate the problem by much, since, as the paper points out, the GC process is likely bounded by NVM write bandwidth rather than computation.

The paper lists several reasons that explain the observed effect. First, GC requires index probings to determine whether a key-value entry in the log segment is alive or not. Besides, after the data copy concludes, the GC process also needs to update the index to point to the new entry. Access to the index will incur a large number of small random reads and writes, which increases the latency of GC operations. Secondly, current implementations use persist barriers consisting of one or more cache line flushes plus a memory fence at the end, in order to maintain correct write ordering between the data and the metadata, which is crucial for correctness. However, the persist barrier is expensive and will severely affect performance. The paper also noted that, on newer hardware models, persist barriers are no longer strictly required, as the cache hierarchy is also included in the persistence domain (with a hardware feature called eADR). However, developers may still want to write dirty data back to the NVM explicitly, due to the better write throughput achieved. Lastly, GC inevitably involves copying data from one log segment to another. The paper observes that most implementations do not separate cold and hot data. As a result, a log segment may contain mixed cold and hot entries, which incurs unnecessary data copy as cold entries, if stored in their dedicated log segments, do not need frequent GC.

Pacman addresses the above issues with software additions to the existing implementations which we present as follows. First, to reduce the number of index probing for key-value entries during GC, the paper proposes adding a special “hint” per entry, called the shortcut, that stores the location of the mapping entry in the index, hence allowing the GC process to bypass regular index lookup and to directly access the mapping entry at the leaf/hash bucket level. The shortcut is initialized when the entry is first inserted into the log, and when the entry is copied during GC. Note that, however, that due to index operations that change the internal node or bucket layout (e.g., B+Tree insertions), the shortcut is only treated as a soft hint and must be validated before use. If the shortcut is no longer valid, then the GC process will still perform the regular lookup. Furthermore, to avoid the memory block pointed to by the shortcut from becoming invalid (e.g., the block may be deallocated and reused for other purposes), the paper also suggests that blocks allocated for the index should never be repurposed for other usages. It would be needed to maintain separate free lists for these blocks such that accessing the shortcut pointers will never induce undefined behavior.

To reduce the number of reads for metadata during GC, the paper proposes special optimizations for two of the most important types of metadata, namely key-value entry size and the “deleted” bit. The entry size simply indicates the number of bytes consumed by the entry, and the “deleted” bit indicates whether the entry is valid or stale. The paper observes that entry sizes are usually small and require only a few bits to encode. As a result, size information can be embedded into the unused higher bits of a pointer within the index that points to the entry in the log segment. As for the “deleted” bit, the paper proposes to maintain them in the DRAM as a separate array. One bit is reserved in the array for every MIN_SIZE bytes of the log segment and the paper assumes that entries are always placed on MIN_SIZE boundary. The “deleted” bit will be lost after a crash due to being stored in the volatile DRAM. However, the paper indicates that the bit can be restored during recovery by checking with the index or restored gradually on-demand during regular operation.

The next optimization Pacman introduces is to decouple data copy from index update during the GC process. The paper points out that, conventionally, GC is performed by first copying an entry from the old log segment to the new segment, and then updating the index for the entry just copied. This operation is repeated for every entry in the old log segment. However, tightly coupling data copy with index update has two negative consequences. First, it lowers access locality since NVM accesses alternates between data and index. Second, it also incurs one persist barrier for every entry being copied in order to preserve recoverability. To counter the problem, the paper proposes a two-stage process, where the GC process first performs data copy from the old to the new segment, and then it updates index entries to commit the changes. Only one persist barrier is needed between the two stages. In addition, during the first stage, instead of using regular memory instructions that can pollute the cache by bringing data to be copied into the cache, the paper proposes using non-temporal stores such that the operation will be directly performed on the NVM, bypassing the cache hierarchy entirely. Second, during the second stage, index entry updates are performed using atomic Compare-and-Swap instructions, such that the index is always in a consistent state, which preserves the atomicity (and hence the recoverability) of the operation.

The last optimization proposed in the paper is to separate cold and hot data, such that frequently modified key-value entries are grouped into the same log segment. Doing so has the advantage of minimizing the amount of data that has to be copied since when a block qualifies for GC, it is likely that most of the entries in the block have become stale as a result of modifications. While prior works have already exploited the exact same idea, the paper proposes a low-overhead approach for classifying hot data which we present as follows. In order to find hot data, i.e., frequently modified entries, Pacman samples modification operations and keeps the trace in a local buffer. Periodically, Pacman sorts the trace by the number of key occurrences and identifies the “hot keys”. Pacman also maintains statistics on the hit ratio of previously selected hot keys. If the hit ratio drops below a threshold, Pacman will start a new round of sampling as it indicates that the hot key set may have gradually shifted away. After hot keys are identified, future insertions and modifications on these keys will be allocated in the dedicated hot key buffer.