Endurance Transient Inconsistency in Byte-Addressable Persistent B+Tree



- Hide Paper Summary
Paper Title: Endurance Transient Inconsistency in Byte-Addressable Persistent B+Tree
Link: https://www.usenix.org/conference/fast18/presentation/hwang
Year: FAST 2018
Keyword: NVM; B+Tree; FAST; FAIR



Back

Highlights:

  1. Atomic transformation between consistent states associates volatile consistency with non-volatile consistency, which is the big innovation about this paper.

Questions

  1. What if values are duplicated at leaf level (e.g. if we store integers)? You cannot simply assume that a copy is going on even if duplicated values are seen. Although the paper proposes using read locks at leaf level to solve the value conflict problem, then FAST is only useful for NVM.
  2. I seriously doubt the correctness of concurrency control of FAST & FAIR. This paper will be better if it just assumes single threaded access and leaves concurrency issues to future work.
  3. I also seriously doubt the claim in the paper that threads conflict on inner level more than they do on leaf level.
  4. Actually there is a concurrency bug as pointed out by a later paper (RECIPE).

This paper presents FAST and FAIR, a B+Tree implementation designed for byte-addressable NVM. The paper identified two major challenges of designing persistent B+Tree. The first challenge is to handle inserts and deletes efficiently without exposing intermediate states, especially when keys are longer than 8 bytes, which is the largest unit of atomic update on NVM mapped memory with regard to failures. Classical B+Tree maintains a sorted array of keys for both inner nodes and leaf nodes, in order to perform binary search. Inserting elements into the sorted array, as a consequence, involves right shifting some elements to higher addresses to make a new slot in the middle of the array. Without proper design, such element shifting may introduce temporary inconsistent states such as lost keys or accessing wrong child nodes. If the system crashes at this point, such intermediate state may not be able to be resolved. The second challenge is structural modification operations (SMO) initiated by threads attempting to split or merge a node. B+Tree SMOs consists of several steps, each of which will bring the affected nodes into inconsistent intermediate states.

Prior researches have proposed adapting logging concepts used in database systems into B+Tree implementations. For example, in order to update a single node atomically, an extra level of indirection is added to each node, which maps logical locations of elements into physical locations in the node. The node is updated by first appending new elements into the end of the storage area, and then updating the mapping atomically using either 8 byte atomic update, or advanced techniques such as Hardware Transactional Memory (HTM). Since elements are not logically committed until the last atomic update step, even if the system crashes at some point during the update, intermediate results cannot be seen. This approach achieves easy node update, at the cost of limiting the node layout and node size, since the mapping field length cannot exceed a cache line, which is the maximum unit of atomic persistence on the NVM side. The mapping layer can even be eliminated by maintaining an in-node log and a single log tail pointer. Threads only update the tail pointer after they have appended data to the end of the in-node log. This way, elements are no longer sorted in the node, and threads need to recovery the current node content by scanning the log every time they traverse the node. As a compromise, some designs allow a node to be partially sorted: Part of the elements are sorted as a result of log consolitation, while new elements are still appended in a los-structured fashion until the next consolidation. Although this seems promising, the paper identified that in-node logging requires an excessive number of cache line flush and memory fences, which are inserted after data append and after adjusting the tail pointer (or the mapping field). Multi-step SMOs are usually handled by lightweight logging or variants of that. Threads first create a log record describing the SMO, including the type of the operation and arguments, and flush the log record, which logically commits the SMO. The thread then locks the affected nodes (or use non-blocking help-along protocol), and conducts the update in-place. The log record is removed only after all dirty items are flushed back to the NVM. On recovery, active log records are found and replayed by the recovery process. As noted by the paper, this approach is universally useful not only to B+Trees but also to some other common types of data structures. It is, however, inefficient due to excessive usage of cache line flushes and memory fences.

FAST & FAIR differs from previous proposals in that it does not attempt to avoid exposing intermeidate inconsistent states to concurrent readers and to the NVM. Instead, node updates and SMOs follow certain carefully designed protocol, such that the intermediate state is still consistent, from which the node image can be recovered. Compared with previous approaches, FAST & FAIR does not double write traffic to the NVM, does not limit the maximum node size, and no recovery time processing is needed. We present FAST and FAIR in the following sections.

FAST, or Failure-Atomic Shift, is a technique for shifting elements in a sorted node without losing consistency. The basic observation is that when elements are shifted from left to right (i.e. low to high address), elements can only be duplicated, but not lost, as long as we copy elements one by one from right to left. For example, for key array | A | B | C | D | E | x | x | x | (x means NULL pointer, indicating the end of the search array), if we insert between C and C, then we first copy element E to the next slot, resulting in a node with two identical ā€œEā€ keys, and repeat until all keys are shifted. The NVM image, however, may be left in inconsistent state if an element is copied across cache line boundary. In the above example, if element E is copied to the next cache line, and then element D is copied to the current location of E, after which the first cache line is evicted back to the NVM, then the NVM image is inconsistent, since the image is | A | B | C | D | D |, while the second cache line is | x | x | x |, resulting in permanent loss of key E after recovery. To avoid this problem, the paper points out that whenever elements are copied across cache line boundaries, the copying thread should flush the higher addressed cache line back to the NVM, persisting the key that it just moved, before the thread can overwrite the previous location of the moved element. In other words, the shifting protocol is designed such that at given moment in time, there should be at least one copy of each key in the NVM image of the node. In addition, among all elements to be shifted by the algorithm, at most one element can have two copies, which is the element currently being copied. As long as the above two rules are observed by the copying thread, the node image can always be recovered by scanning the two arrays (key array and value array), and ignoring the duplicated item. In practice, reading threads do not need to rebuild the entire node explicitly. In fact, the correct child node can be found by simply scanning the node, keeping track of the current key and value respectively. The current key and value is updated if and only if the next key or value is different from the current one. Since we assume keys and values can be copied atomically, both copies are valid during the shift.

The FAST algorithm works even in an architecture that reorders non-dependent loads and stores. Our previous discussion assumes Total Store Ordering (TSO), in which stores will not be reordered, and dependent loads and stores are always committed in program order. As a result, since each store to a key slot (except the first one storing to NULL pointer field) is dependent on a previous load reading from the slot, key array loads and stores are serialized in program order. Furthermore, since stores to keys and values slot are always ordered, and value array stores are serialized after corresponding loads just like key arrays, we can conclude that in TSO architecture, key and value copy are synchronized, such that the dynamic instructions for next key value pair (lower addressed) copy will not be executed before the ones for current key-value pair copy are committed. In relaxed memory ordering, keys and values may be totally out-of-sync, in a sense that the two instruction flows for copying keys and values can be scheduled such that the current key and value being copies are not necessarily in the same pair. This relaxation, however, does not affect correctness of reads, because we do not assume that the duplicated keys and values are on the same slot. Instead, we simply assume that there can be duplicated keys and values on a certain slot, in which case, we just ignore the duplicated item, and keep scanning the node.

The above observation does not work for keys that cannot be atomically copied using 8 byte store instructions. In this case, threads may read partially copies keys, and fail the comparison (either invalid access, or comparison giving the wrong result). One easy way of dealing with this is to enforce that all keys are 8 byte in size using indirection, i.e. only storing pointers to the actual key objects. This simple solution hurts performance as each key access now becomes a pointer dereference and brings a potential cache miss. Another more complicated solution is to enforce TSO by inserting memory barriers after copying each key-value pair. The memory barrier will order key copy and value copy to make them synchronized, i.e. at any moment in time, if we detect duplicated values, then the keys must also be duplicated, since the current pair of key and value is always copied before the next pair could. Since values in B+Trees are often just pointers to external objects, or pointers to the next level, we can still derive the consistent node image by checking if two adjacent values are duplicated, and if true, then we know keys are also duplicated, in which case the duplicated pair will be ignored.

The paper also proposes FAIR, Failure-Atomic In-place Rebalance, which is a lock-free protocol for conducting B+Tree SMOs while allowing concurrent readers. The idea of FAIR is based on Blink-Tree, in which each node in the tree, not only leaf nodes, has a sibling pointer. Threads are required to check the high key of the current node and the low key of the next node to determine whether it should traverse vertically to the next node when it reaches the next level via a child pointer. Such traversal may span several sibling nodes if contention is really heavy on the sibling nodes (i.e. a node is splited before the thread spliting it even had a chance to insert into its parent). By adding a sibling pointer, the partial state in which a sibling node is created but has not been added to the parent becomes a valid state, since all elements in the original node can still be accessed even if the system crashes at this point. We describe node split as follows. The first step of node split is to create a new node, which sibling pointer points to the next sibling of the old node, and has the upper half of the old node. We persist the sibling node before linking it to the old node. In the second step, the sibling pointer of the old node is changed to the new node. This step does not change the logical node content, since the low key of the new node is lower than the high key of the old node, in which case threads will just ignore it (and will try to fix it by completing the partial state during lazy recovery). In the second step, we activate the sibling node by storing a NULL pointer to the split point in the old node. This store can be made atomic as long as threads check value for NULL first during the scan (or changing the node size field). After this point, the new node is activated, and can be inserted into by threads traversing to it via the sibling pointer. In the last step, we insert a new element into the parent node using FAST. In the above protocol, all changes are flushed back to the NVM after they are performed to ensure that updates appear on the NVM in intended order.

When a node underflows, FAIR first removes its entry from the parent node, which logically merges this node with its left sibling. Then the merging thread checks whether there are enough number of entries in the left node. If true, the thread will insert the entry into the underflowed node first, before it deletes the element from the left node (by storing NULL into the value field). After nodes have been balanced, we re-insert the split key into the parent node using the current low key of the right node.

On recovery, no special recovery routine needs to be taken. Since the tree structure is always kept consistent, the NVM image of the tree can be directly loaded to perform searches and updates. A thread, however, needs to fix inconsistency when they see one during normal execution tp ensure that later operations will be conducted on a non-transient state.