BzTree: A High-Performance Latch-Free Range Index for Non-Volatile Memory
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3164147
Year: VLDB 2018
Keyword: MWCAS; NVM; BzTree
This paper presents BzTree, a lock-free and persistent B+Tree data structure designed for hybrid DRAM and NVM systems. This paper identifies two atomicity requirements for in-memory B+Tree typically used as database indices. First, multi-step updates should be atomic with regard to other threads. This avoids exposing some inconsistent intermedaite states to other threads which may lead to data corruption or program failure. Second, multi-step updates should be atomic with regard to the NVM. Either none of the operations or all of them should be reflected on the NVM image at any point during execution. The problem, however, is that current hardware primitives are insufficient to guarantee any of these two, with reliable progress guarantee (HTM can satisfy the first requirement, but there is no progress guarantee unless a software fall back path is provided).
This paper presents the implementation of BzTree, which achieves atomic tree update with regard to both other threads and the NVM. The atomicity guarantees of BzTree are delivered under a unified Multi-Word Compare-and-Swap (MWCAS) framework, in which multiple aligned machine words can be tested for expected values and then updated to new values if all tests are successful. In addition, the MWCAS primitive is extended with logging support which allows fast recovery upon a failure to reapply committed updates, while rolling back partial changes to recovery the data structure to a consistent state. These two combined together ensures that the structure of the tree is always consistent both in the DRAM and in the NVM at any moment during execution.
The MWCAS operation proceeds in two stages. The MWCAS implementation used by BzTree assumes that the target words are either virtual address pointers, or small integers (much smaller than 2^64), or bit fields. In all three cases, we can dedicate three control bits from the target word for defining the type of the word, which we discuss as follows. The first bit is a “dirty” bit, which indicates that the current DRAM image may differ from the NVM image, and the value may not be persisted when a crash happens. The MWCAS protocol dictates that a dirty value must be persised to the NVM before they can be used. This is to prevent a thread accessing dirty data from making its own updates, which must be rolled back if later on crash happens and the dirty data is lost and cannot be recovered. (NOTE: I personally do not think this will happen). After persisting the word, the dirty flag can be cleared in the DRAM using a single word CAS. Note that CAS is necessary in this case, since there can be concurrent modifications on the same word by other threads.
The remaining two bits are used to indicate whether the value stored in the target word is a regular value, a descriptor of MWCAS, or a descriptor of RDCSS. RDCSS descriptors are used to implement Restricted Double Compare and Single Swap. We use RDCSS in the first stage of MWCAS to avoid certain subtle race conditions that may break the atomicity of MWCAS. After reading a value, threads should always check these two bits to see whether the value is temporarily locked by another thread by replacing the regular value with a descriptor. If it is the case, the thread should “help-along” and finish the incomplete RDCSS or MWCAS using information recorded in the descriptor. This “help-along” protocol makes BzTree unblocking, since threads can always make progress by helping other threads finish their operations, even if when the initiating thread of the operation is switched out by the OS.
We then describe the node layout as follows. A node consists of four parts. The first part is a node header, which consists of a status word, and two constant fields for remembering the node block size and the number of sorted elements in the node (see below). The status word will be updated by MWCAS, so its highest three bits are reserved. The status word is the place where updates to the node, structural moficiation (split and merge), and node consolidation serializes against each other, as each of these three types of operations will first read the status word, and either update it or use it as a “snapshot” of the state when it is first read. The second part is an array of metadata entries, which contains pointers to keys and values in the data region. An entry is of word length, which also contains the three control bits, and a “visible” bit to determine whether the entry is visible to concurrent threads. The “offset” field points to the first byte of the key-value pair within the node. The third part is free space, which can be used for growing both the metadata entry array, or key-value pair. The last part is key-value pair storage, which is maintained as a stack which grows from high address to low address. The next available byte is denoted by a field, “record count”, in the status word, which is adjusted every time new entries are inserted.
Internal nodes of BzTree are simpler, as they do not support concurrent modification. Internal nodes simply consist of a key-pointer array, a constant field recording the size of the node, and a status word serving as the serialization point between node update and replacement (insert/delete/split/merge). Internal nodes are immutable. Only the value field can be updated in-place directly, which needs to serialize against node replacement as we will show below.
Entries in a leaf node is not always sorted. In fact, since BzTree performs log-structured updates to leaf nodes, entries are appended to the end of free space when they are inserted. Given an initially sorted node (e.g. after consolidation), this essentially divides the free space (and metadata array) into two regions: The first sorted reagion which can be searched using binary search, and the second unordered region which stores entries out-of-order. If the value is word size pointers, BzTree dictates that no key can exist in both region, i.e. a key is either in the sorted region, or in the unordered region. If the payload is embedded in the node as binary, BzTree allows an update to only store the “delta” of the binary, in which case the same key can exist in both region, and there can be multiple keys in the unordered region. But still, only one key may exist in the sorted region. In the following discussion we assume that the value is a fixed sized pointer to external objects to avoid confusion.
The same BzTree construct can be used to support various tree configuration. For example, the paper suggests that the slotted-page based node layout supports both fixed-size key and variably sized key. Similarly, values can be pointers to objects, or can be embedded as variably sized objects. In addition, the same code can be used for both DRAM-only B+Tree, and hybrid DRAM-NVM B+Tree. The only difference is whether dirty values are flushed back to the NVM when they are set. For DRAM-only B+Tree, the dirty value is never turned on.
We next describe tree operations. Tree insert consists of two atomic actions. First, we read the status word, and check whether the node has been frozen (“freeze” bit is set, which indicating the node will soon be replaced). If true, we retry the traversal from the tree root to avoid inserting into a node that will be discarded. Next, we reserve space for the inserted key and value, and the metadata entry by performing a MWCAS on the status word, the “visible” bit of the metadata entry, and the “offset” field of the same entry (actually I think this should be a single compare double swap as we do not care the initial value before the MWCAS). If the allocation is successful, we then copy the content of the key and value into the space just allocated from the free chunk. After that, we perform another MWCAS on the status word and the metadata entry to set the visible bit in the metadata entry to true, while guaranteeing that the node is not frozen in the meantime by another thread. (Again, I think this should be a double compare and single swap, as we do not update the status word, but just to “lock” it to avoid concurrent modification).
Node delete is simpler. After checking the freeze bit and locating the deleted entry in the node, we perform a MWCAS on both status word and the metadata entry. We change the status word such that the delete has an observable effect which determines its serialization point against concurrent inserts and deletes. The “deleted key” field in status word is therefore incremented using the MWCAS. As for the metadata entry, we clear the “visible” bit to indicate that the entry no longer exists.
Note that when the “visible” bit is clear in an entry, there are two possibilities: The entry can either be allocated but not initialized, or the entry is just deleted. To distinguish between these two cases, we perform an addition update on the “offset” field when inserting into the node: The “offset” field is updated to a “global epoch number” which is maintained as a non-zero counter. When the node is deleted, we simply set the offset field to zero. This way, when the “visible” bit is clear, we can distinguish between an uninitialized node and a deleted node by comparing the offset field against zero. The offset field which is set to global epoch number will be also useful for recovery, as we will see later.
Key insert and key delete are serialized against each other on the status word. Node inserts MWCAS on the status word twice, one to allocate space, and another to make sure the status word has not changed since the allocation. Inserts and deletes are serialized against node consolidation, split and merge also on the status word by checking the “freeze” bit before the operation, and using the status word without “freeze” bit in MWCAS to ensure that no node replacement can happen during the insert/delete. One potential problem is that node insert is not atomic, since we need to copy the key and value into allocated space, which cannot be done easily with one MWCAS. Node insert consists of two MWCAS, which exposts an intermediate state in which a metadata entry is allocated, but its content is not yet available for key comparison. This introduces problem for both read and concurrent inserts/deletes, since the key is unknown. In the simplest case, concurrent threads can just spin on the “visible” bit for a while to wait for the key-value copy. If spinning seems infeasible, then the thread can continue its operation optimistically by assuming the missing key does not conflict with its operation. To prevent the unfortunate case where the missing key does incur a conflict, the thread must recheck after it has allocated space for its own operation (assuming it is an insert) but before it sets the “visible” bit for its own entry. If there is indeed a key conflict, the thread just deletes its own entry by clearing the offset field. Deletes and reads can safely ignore missing keys, as they just serialize before the insert.
Note: In fact, I do not think BzTree is wait-free due to the fact that inserts require two atomic MWCAS operations. If a thread never wakes up after allocating space but before initializing its key-value, then all following insert threads will be blocked by this thread, since they cannot continue the insertion until the missing key is filled. On the other hand, they can choose to serialize before the potentially conflicting insert, as long as inserts use the new value of the status word from the first MWCAS as the old exepcted value for the second MWCAS. This guarantees that if an intervening insert/delete happens on the node, the second MWCAS must fail due to updated status word. This workaround, however, is not on the paper. ***
Reading a leaf node consists of binary searching the sorted part, and scanning unsorted part if binary search does not find the entry. For embedded objects, if delta is posted for each insert, we need to conduct a full scan of the node no matter whether the key is found by the binary search. The deltas are applied into the initial image of the embedded object in the order they are appended into the node.
As new entries are appended into the leaf node, searching can become a performance bottleneck, as the length of the unordered array to scan for each lookup will keep growing. To reduce the extra overhead, we periodically “consolidate” the node by sorting all entries in the node, and copying them into a new node. The old node is replaced by a atomic switch on the parent node child pointer. We describe this process as follows. First, the freeze of the node is checked to ensure it is still valid. Then, we perform a single word CAS using the MWCAS primitive (for persistency) to set the freeze bit of the leaf node. This MWCAS serializes the node consolidation against other concurrent operations on the node. Operations observing this MWCAS must re-traverse the tree, since the node will soon be replaced. Next, a new leaf node is allocated, and entries in the old node is sorted and inserted into the new node. The last step is to install the new node into the parent node. We assume each tree traversal maintains a node stack that stores nodes on the path. The parent node is first checked for validity, and then a MWCAS is performed on the parent node’s status word and the corresponding child pointer to switch from the old child to the new child. Note that the MWCAS here is also a double compare and single swap, the purpose of which is just to ensure that the status word is not changed during the node switch. After asuccessful MWCAS, the old node can be moved into the garbage chain, which is then freed after all threads being able to see it enter quiescence. Inner nodes of BzTree do not require consolidation, since they are immutable and does not support in-place insert. The paper reports that this scheme improves performance by avoiding node search at intermediate level.
Node split and merge works similarly. Since inner nodes are immutable, node split and merge must be finalized by creating a new copy of the inner node for the affected inner node, and then performing a MWCAS in a similar manner as node consolidation on the grandparent node. For a spilt, the sibling node is created, and then inserted into the newly created inner node. For a merge, a sibling is chosen (either left or right), and a new node is created by merging active entries in both nodes. During this process, we must ensure that the nodes under operation are always valid by first checking their freeze bit, and then use MWCAS to set the freeze bit. Replaced nodes are sent to the garbage collector for future reclamation.
On recovery, no BzTree-specific routine is requires. The MWCAS library handles all unfinished operations by either rolling them back or reapplying changes. Note that insert operations consists of two atomic MWCAS, one to allocate space and another to activate the entry. If the system crashes between these two steps, then a dead entry will be left in the node, which may block future operations (since a key is left undermined). To deal with this, on every crash recovery, the global recovery epoch counter is incremented by one. On future tree operations, when we see a suspected dead entry (i.e. active bit cleared), we compare its offset field with the current epoch counter. If values do not match, we know this entry is a dead entry from an incomplete insert before the crash, at which case we simply zero out the key and value, and mark it as deleted.