LB+-Trees: Optimizing Persistent Index Performance on 3DXPoint Memory

- Hide Paper Summary

Paper Title: LB+-Trees: Optimizing Persistent Index Performance on 3DXPoint Memory
Link: https://dl.acm.org/doi/10.14778/3384345.3384355
Year: VLDB 2020
Keyword: NVM; B+Tree; LB-Tree

Back

Highlight:

Generally speaking it is a really good design, with all factors balanced, and not much artifacts that requires special treatment and/or limit its usage. Another particularly nice thing is that there is no fatal bugs or unexplained details in the protocol. Everything is as clear as crystal, unlike some data structure papers. Some techniques have already been proposed, such as HTM-driven read, in-DRAM inner nodes, unordered node + header bitmap for update, and fingerprint for fast, SIMD hit prediction.
The actual contribution of the paper is the two optimizations: (1) Update threads proactivaly move elements from the first cache line to other lines to increase the opportunity of atomic metadata + data flush; (2) Using shadowing + sense bit on node metadata fields to allow atomic update of several distinct metadata fields.

Questions

Reader threads should turn into writer threads, if HTM aborts for a few times, to guarantee progress. Current TSX does not even guarantee the eventual commit of the transaction, so a software fall back is always required. This is only a minor bug though, since this paper is mainly about NVM, not synchronization.

This paper presents LB-Tree, or Line Write/Logless B+Tree, a NVM optimized B+Tree design for efficient updates. Although not explicitly stated, the paper identifies a few inefficiencies in conventional in-memory B+Tree designs when ported to the NVM, such as ordered nodes, the overhead of structural modifications, and maintaining consistent node states.

The paper makes three critical observations that enable extra optimizations on NVM. First, it is observed that the performance of NVM writes of a cache line does not depend on the number of dirty bits in the cache line. Theoretically speaking, NVM may adopt optimizations to minimize the number of flipping bits in each internal writes, to reduce the number of physical state changes. This optimization is not present in the current commercial product, most likely because the NVM device encrypts data stored in the physical array, in which case a single bit change in the input line will propagate to the remaining bits, offsetting the benefits of the optimization. This observation suggests that B+Tree designs may perform extra writes in addition to the requested one if these extra writes lower the overhead of the following updates. The second observation is that NVM internal units of reads and writes are 256 bytes, meaning that all reads and writes are performed internally with 256-byte persistent buffers despite the 64-byte memory interface. As a result, reads to aligned 256-byte data can achieve higher bandwidth, since the internal buffer can sustain higher bandwidth than accessing the physical array. Similarly, writes to aligned 256-byte data are also faster. The last observation is that the performance of NVM device also depends on storage utilization. The performance of a new device drops until the working set size grows exceed one eighth of the device capacity. Although this property does not affect the design directly, it indicates that testing should use at least one eighth of device storage in order to be cover all cases.

Overall, LB-Tree leverages the following techniques to make writes efficient. First, elements in leaf nodes are not sorted, since maintaining the ordering property of leaf nodes will involve shifting existing elements in the node, causing massive, multi-line updates. Instead, elements can be stored anywhere in the leaf node, even leaving gaps between two valid elements. A bitmap in the node header tracks which slots are occupied and which are empty. Insert operations, therefore, can just select an empty slot using the bitmap, and fill key and value into the slot. Search operations on leaf nodes, as a result, must scan the node linearly, skipping invalid slots using the bitmap.

The second technique is that inner nodes are maintained in the DRAM, and rebuilt at recovery time. This avoids complicated race conditions and corner cases when updating inner nodes on leaf splits, since inner nodes will be wiped out on crashes. The tree itself, however, can always be rebuilt, since B+Tree leaf nodes always maintain full knowledge of the state of the tree. As long as leaf nodes are in a consistent state, inner nodes can be rebuilt in a reasonable amount of time. It does not matter whether inner nodes are sorted or not, since they do not have atomicity requirements.

The third technique is read-only optimization using HTM support for thread synchronization. Reader threads do not acquire locks, and always proceed optimistically, assuming a consistent node image. Writers, however, acquire per-node lock as it traverses down the B+Tree by setting a lock bit in the node header. Readers will be notified when read-write conflicts occur via HTM, as we will see below. This allows fast and synchronization-free reads, compensating for the loss of efficiency due to unordered leaf nodes.

The fourth technique is element moving for less persistent barriers. This technique leverages the observation that if the element to be updated is in the same cache line as the header, which will always be updated on insert operations, the write operation that fills in the element does not need to be flushed and persisted before the header does, since it is guaranteed that a 64-byte line will be persisted atomically. In this case, the write operation simply fills in the element slot, updates the bitmap, and then executes a persistence barrier, reducing the number of barriers to one, instead of two. In addition, write operations to the remaining part of the node may also proactively move elements in the first cache line of the same node to the cache line it updates. According to the observation that the number of dirty bits does not affect NVM write performance, this essentially optimizes future write operations on the same node without incurring any NVM write overhead.

The fifth technique is logless split operations. LB-Tree optimizes node split with shadowed copies of the sibling pointer and a special “sense” bit indicating which sibling pointer is currently in-use. Element copy and sibling pointer updates can hence be performed atomically with a single write to the node header as we will see later.

We next describe leaf node layouts. Inner nodes can use arbitraty layout as long as it contains a lock bit for writers to acquire. Leaf nodes are always aligned, whose size is a multiple of 256 bytes, which matches the internal buffer size of NVM for maximum throughput. In the following discussion we assume 256 byte nodes. The node consists of a header, an array of key-value pairts as elements, and two sibling pointers at the end. The header is 16 bytes, which consists of a 14-bit bitmap for each of the 14 elements in the node (assuming 8-byte keys and values), one “lock” bit for writer synchronization, one “sense” bit for shadowing the two sibling pointers (see below), and a 14-byte fingerprint array. The fingerprint array has one byte hash value for each valid element in the node. Read operations will first hash the key into a byte value, and then test whether it matches any of the 14 values in the fingerprint array using SIMD comparison. A match indicates a possible hit, which should be confirmed with a full key comparison, and a miss indicates a definite miss. The next 14 16-byte key-value pairs are the body of the leaf node. The last 16-byte pair stores the two sibling pointers. The active pointer in-use is determined in the “sense” bit stored in the header.

Leaf node updates on an existing key is trivial, since the update can always be carried out by atomically writing the 8-byte value. Insert operations consist of several steps, but only the last step “commits” the updates in terms of crash recovery. The first step is to read the header’s bitmap, and find a vacant slot. Assuming the node is not empty, the next step is to fill in the slot with the key and value to be inserted. The fingerpring value is also updated. Till this point, none of these changes are logically visible, even if some dirty lines may have already been evicted back to the NVM, since the content of the node is described by the header bitmap. In the last step, the insert is committed by first evicting all modified cache lines, except the first line of the node, to the NVM using persist barriers, and then updating the bitmap and then evicting the first line which contains the header field to the NVM. The second barrier serves as the commit point of the insert operation. If the line was persisted before the crash, then the insertion is committed after crash recovery. Otherwise, it will be rolled back silently.

One special optimization can be applied if the element is inserted into the first cache line of the node. In this case, the element and header update can be flushed back atomically in a single persistence barrier. To increase the chance that this happens on an insert, other insert operations will proactively move elements in the first cache line to vacant slots of the line to be updated by the operation. Since the line to be updated is already brought into the cache, this does not incur extra NVM writes, but just a few more CPU cycles.

Threads synchronize on the lock bit in the node’s header. The paper assumes optimistic 2PL protocol where writer thread acquires locks on its path down the tree, while readers optimistically assume that the node will remain consistent during the read. To keep the assumption valid, reader threads will start a new HTM transaction before it starts tree traversal. An each node, it first checks the lock bit, and aborts the traversal if it is already set, since a writer is currently updating the node. If the lock bit is not set, it proceeds to access the node. If, in the meantime, a writer thread attempts to acquire the lock, the reader will be notified by HTM, and automatically abort, since the reader has already subscribed to the “lock” bit.

Leaf node splits are performed as follows. First, an empty node is allocated from the persistence heap. The upper half of the current node is copied to the empty node. To maximize the chance that future updates are performed on the first cache line of the new node, the elements will be copied to the upper half of the new node, rather than lower half as in a conventional B+Tree split scheme. Header bitmap and fingerprints are also initialized respectively. The next two steps must appear atomic with regard to failures. The first is to update the old node’s header bitmap to remove elements that have been copied to the new node. The second is to set the sibling pointer to the new node. Otherwise, if they are not atomic, the after-crash state may not contain both steps, and elements will either be duplicated or be lost if one of the two steps is not persisted. To solve this, LB+Tree shadow-updates the sibling pointer using one of the two sibling pointers, and then switches the pointer by flipping the “sense” bit. Since the “sense” bit and the 14-bit bitmaps are on the same 8-byte word, these two can be updated atomically.

LB+Tree also supports larger nodes, the size of which must be a multiple of 256 bytes. The paper, however, discourages the intuitive design where the header describes all elements in the node, since this will reduce the number of elements in the first cache line. Instead, headers of large nodes are partitioned into smaller headers, each responsible for a 256-byte chunk. Each 256-byte chunk will then have two instances of headers, one at the beginning, another at the end, just as if it were a separate node without sibling pointers. These headers, however, cannot be updated atomically with the master header, which sits at the first cache line of the node, and must be used for lock acquisition and “sense” bit. The solution is to shadow write all slave headers in the node between the two instances, and use the “sense” bit to atomically switch between shadow copies, which is performed atomically with master header update.