NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems

- Hide Paper Summary

Paper Title: NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems
Link: https://www.usenix.org/conference/fast15/technical-sessions/presentation/yang
Year: FAST 2017
Keyword: NV-Tree; NVM; B+Tree

Back

This paper proposes NV-Tree, a B+Tree designed for byte-addressable non-volatile memory. Differing from previous proposals, this paper is based on battery-backed NVDIMM, which flushes data in the DRAM back to SSD on a power failure using residual power from on-chip batteries. The battery-backed NVDIMM features non-volatile data storage like SSD or PCM-based NVDIMM, while having the bandwidth and latency of DRAM. The paper points out that previous designs, such as LCB+Tree and CDDS-Tree, suffer from excessive cache line flushes and memory fences. LCB+Tree uses redo write-ahead logging for all update operations to ensure failure atomicity. The log record is first written, followed by a flush, after which we update data in-place and then flush dirty data. The log entry is only removed after all dirty cache lines are flushed. In CDDS-Tree, updates are performed directly in-place, by creating versions. In addition, to maintain sorted property of keys within a node, elements are shifted when a new key is to be inserted into the node. This requires the programmer to insert one cache line flush per element shift, resulting in more cache misses. Experiments show that both cache line flushes and node access incur non-negligible cache line misses.

NV-Tree reduces the amount of cache line flushes and cache misses by making the following design choices. First, only leaf nodes are persisted and maintained in a consistent state, while inner nodes are kept volatile. Inner nodes are rebuilt from leaf nodes after a crash or unexpected shutdown from leaf nodes. Second, while previous proposals maintain keys in sorted order within a node by shifting elements when a new element is to be inserted, NV-Tree only performs append-only updates to leaf nodes, which optimizes writing, but requires a full node scan to read elements. Inner nodes, on the other hand, are updated as usual since they will not be persisted to the NVDIMM. The last design choice is that inner nodes on the same level are maintained in a consecutive chunk of memory, increasing node utilization and reducing cache misses since no child pointers are stored in inner nodes except the last level. When inner nodes split, however, the entire array must be rebuilt.

Updates to leaf nodes are performed in append-only manner. Concurrent updating threads serialize on a per-node lock, while reader threads can access the node in a lock-free manner while a concurrent update modifies the node. Each node has an item count field at the beginning to count the number of valid elements within the node, followed by an array of key-value pairs as the body of the node. Each key-value pair also has a flag to describe the pair as either inserted or deleted. A deleted pair indicates removal of a previously inserted element within the same node, while inserted pair indicates that the pair is currently live in the node whose value can be returned by a reading thread.

We next describe the node update protocol. The updating thread first acquires the per-node lock before any node access. Then the thread scans the node to determine whether the operation needs to modify the node. For inserts, if the last element of the inserted key is a live entry, then insert does not need to be performed, since we already have the element. Similarly, for deletes, if the last element of the deleted key is a deleted entry, or the key does not exist in the node, then the operation returns immediately since the entry does not exist. Otherwise, inserts and deletes are performed by first writing the key-value entry (deletes do not need the value) to the next unused entry (indicated by the current value of the counter), setting the flag correspondingly, and committed by an atomic increment of element count in the node header. The operation is committed to the persistent image by flushing the new value of the counter back to the NVDIMM. If a crash happens before the counter value is flushed, but after the pair has been inserted, no recovery is needed, since node scan uses the counter to delimit between valid and invalid data in the node. Concurrent reader threads serialize with updating threads by the order of reading and updating the counter. If the reader thread reads the counter after it is updated, the updates are then visible to the reader.

When there is no available entry in a leaf node, insert and delete operations must first consolidate or split the node depending on the number of live entries in the node. The leaf node is split, if the number of live entries (i.e. entries that are not invalidated by a later entry with the same key) exceeds a certain threshold. Otherwise the node is consolidated. Node consolidation is performed by collecting all live entries in the node, and inserting them into a new node allocated from persistent memory address space. The new node’s next pointer points to the current sibling of the node being consolidated. The new node is then flushed back to the NVDIMM. The consolidation is committed by atomically changing the pointer in the previous sibling node from the old node to the new node. Note that after this step, both the new and old nodes are accessible: Threads traversing down from inner nodes still access the old node, while threads traversing from the previous sibling see the new node. This does not matter, however, since we lock both the old and the new node for exclusive write permission such that no thread may insert into or delete from the old node anymore. If the system crash at this moment, all inner nodes will be lost (since they are in volatile memory), but the image of leaf nodes is still consistent, and the system will recovery using the new node. As the last step of consolidation, the thread updates the key-pointer pair in the inner node. This can also be done atomically using 8-byte atomic write without locks.

Nodes split follows a similar process: We first lock the old node as in node updates and consolidation. We then create two nodes, one containing the lower half of the old node, another containing the upper half. These two nodes are linked together using the sibling pointer, which is then linked to the next sibling of the old node. Split is committed by atomically updating and flushing the sibling pointer of the previous node. Note that threads traversing from upper levels will not see this inconsistent state, since the upper level nodes still point to the old node. The spliting thread then installs the new key-pointer pair into the parent inner node, and updates the parent node’s existing key-pointer pair. This process will not introduce temporarily inconsistent states, since after we install the new key-pointer pair, threads accessing the upper half of the old node will be redirected to the newly created split sibling, while threads accessing the lower half still access the old node. Updates will be blocked by the per-node lock as in node consolidation. Node merges are just like node splits, except that we cannot reclaim the memory of the node to be merged, since concurrent readers may still be accessing them.

Inner nodes are further divided into two types. Last level inner nodes contain key-pointer pairs that point to leaf nodes just like normal B+Tree nodes. Other inner nodes, on the other hand, only contain keys but not pointers. Assuming that a last level inner node contains m keys and (m + 1) pointers, and that both keys and pointers are 8-bytes (the most common configuration used for in-memory indexing), an inner node can contain up to (2m + 1) keys, which utilizes node storage more efficiently. To aid finding the next level node without using an explicit pointer, the paper proposes that all inner nodes on the same level be allocated as a single chunk of memory. The number of nodes in the chunk is determined by the number of nodes in the previous level and the fan-out of the tree. For example, assuming that the previous level has N nodes, and the fan-out of inner levels are (2m + 1), the next level memory chunk must have (N * (2m + 1)) nodes in order to assign a child node to each element in the most extreme case. Note: What I don’t get is that if inner level nodes have (2m + 1) keys, then the fan-out should be (2m + 2)? In the paper the equation says (2m + 1). During tree traversal, if the thread has found its path using the j-th key in node i of the current level, then the implicit next level node ID is calculated as ((2m + 1) * i + j + 1), given that the first key implicitly maps to the second child node (this is where j + 1 comes from). The offset of the node within the chunk can then be calculated using the base address and the node size. The last level inner node contains both keys and pointers to the leaf nodes, which is searched like a normal B+Tree node.

When an inner node becomes full, NV-Tree must rebuild the entire inner node hierarchy as described below. If a leaf node split would like to insert a new key-pointer pair into the last level inner node, but the parent node is empty, the rebuild process will be triggered. The rebuild process allocates a larger chunk of memory such that each node in the chunk has only one element from the previous chunk, in order to amortize this expensive rebuils operation over as many inserts as possible. New entries are inserted into the upper level to match the new layout of nodes. This rebuild process is performed recursively if upper levels also need a split, which will eventually propagate to the root node, in which case a new node is allocated. The pointer to the root node is part of the tree metadata, which will be used as the starting point of the traversal.

To reduce the overhead of rebuilding the tree every time it is loaded, the paper suggests that inner levels should be written back to the NVDIMM on a normal shutdown. The root pointer should also be written back before a shutdown notice is written and flushed. If this shutdown notice is not found, inner levels are assumed to be corrupted, and will be rebuilt from the scratch using the leaf node chain stored on NVDIMM.