WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems



- Hide Paper Summary
Paper Title: WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems
Link: https://www.usenix.org/conference/fast17/technical-sessions/presentation/lee-se-kwon
Year: FAST 2017
Keyword: ART; Radix Tree; WORT



Back

This paper proposes WORT, an algorithm for implementing radix tree and its variant, Adaptive Radix Tree (ART), on persistent memory. The paper identifies that it is challenging to make B+Trees persistent, since B+Tree maintains keys within a node in sorted order for efficient search. In order to update a node, either we no longer require such sorted property of nodes, and update nodes in append-only manner, or we add an extra level of indirection, and relies on atomic update of the indirection field to insert keys atomically while maintaining sorted order. Both methods have problems, as append-only updates hurts search performance for large nodes due to loss of sorted key property, while the indirection level incurs more memory accesses which results in longer critical path. In addition, structural modification operations (SMO) on B+Trees cannot be performed atomically with regard to NVM. Either we use redo or undo logging to ensure post-crash recovery always restoring the tree to a consistent state, which incurs excessive writes, or we design each atomic step of SMO in such a way that the tree is always transformed between consistent states. The latter scheme complicates the tree update protocol and may introduce subtle bugs.

This paper proposes making radix tree persistent as a replacement for B+Trees as in-memory indexing structure. Compared with B+Tree, radix trees have the following properties. First, radix trees only accepts fixed sized keys, while B+Tree accepts variably sized keys. Binary representations of keys are divided into fixed-sized slices, from MSB to LSB, the numeric value of which are then used to index child nodes for the next level search. This process is repeated for every key slice, until the bottom level is reached. The way that a key slice is mapped to the next level node is implementation dependent. In its simplest form, a node just consists of an array of child node pointers, and we directly use the numeric value of the key slice as the index into this array to locate the child node. In more complicated schemes, an indirection level is used to compress the array when the array is sparse (i.e. most entries are NULL, indicating that the corresponding key prefix does not exist in the tree). The second property of radix tree is that the maximum height of the tree is fixed, which is proportional to the size of the key. To reduce the height of the tree or the height of certain paths, path compression is often used to merge nodes that only have one child node. The third property is that when keys are inserted sparsely in the key space, radix tree demonstrates very low memory utlization, since only a few slots in each node will be filled with non-NULL entries. Again, we use path compression to remove nodes that only have one child node and store the aggregated prefix in the compressed node itself. In the paper, it is assumed that key slices are 4 bits, and that each node consists of 16 child node pointers. Key slices are mapped to child nodes using direct mapped array indexed by the numeric value of the four-bit slice.

Insert operations are easily made failure atomic by always updating the node in the tree in the last step. When a key is to be inserted, we first traverse the tree using prefix of the key, until leaf level is reached, which means that the key already exists, or until a NULL pointer is found in a non-leaf node. In the latter case, we create the rest part of the path by allocating new nodes and linking them together. After flushing these new nodes back to the NVM, we update the last node in the current path such that a child pointer is stored to the corresponding slot pointing to the newly allocated node. The insertion is completed by flushing the updated pointer back to the NVM.

Path compression reduces the length of a path by merging nodes that only have one child node. For example, given a path down the tree: u, v, w, x, y, z (u is not nucessarily the root node), in which every node except z has only one child. The letter represents the value of the key slice that is used to traverse to the current node from the parent node (e.g. node “u” is reached by using value “u” as the index to read the child pointer from the parent node). In this case, traversing through u to z is redundant, since the tree structure guarantees that any traversal starting from node u must reach node z, or stop half way if the key does not exist in the tree. Based on this observation, we can compress this path into a single node z’, which has the same content as node z, but also stores the implicit prefix “vwxyz”. Note that prefix u does not need to be stored, since it is already implied by the fact that the traversal uses “u” as the key to index the parent of node u. The compressed node z’ is connected to the parent node of u in index position “u”, such that threads traversing the tree will reach node z’ when it uses “u” as the key slice. To ensure correctness, a prefix comparison is performed to check whether the actual prefix in the search key matches the implied prefix, and if true, the prefix will be skipped without being used to index child nodes. Otherwise the key does not exist in the tree. To enable prefix comparison, each node has an 8-byte prefix header. The first two bytes are used to record the current level of the node and the number of bytes in the implied prefix. The remaining six bytes stores the actual prefix, enabling path compression of 12 nodes at a maximum (since this paper assumes 4 bit key slices).

When a key whose prefix does not equal the implied prefix is inserted, the traversal of the insert operation will stop at the compressed node z’ as a result of unmatching key prefix. A path expansion is performed to reduce the size of the implied prefix, and inserts the new node to store the newly inserted key’s prefix. The paper uses the following algorithm for path expansion. We assume that the newly inserted key has the same prefix until node x, i.e. the key has a prefix: u, v, w, a, b, c. In the first step, we create a new node that stores prefix “vw”, which has two child nodes: on slice “a”, the child node is newly allocated also, and initialized to store prefix “bc”; on slice “x”, the pointer to the old node z’ is stored. In the second step, we update the prefix word in node z’ to increment the node level by one, and to update the prefix size to 2 while storing “yz” as the prefix. Since the prefix word is only 8 bytes, it can be updated and flushed back to the NVM atomically. Note that at this stage, the two newly allocated nodes have not been linked into the tree, and the tree is in an inconsistent intermediate state. Fortunately, such intermediate state can be identified by a post-crash recovery process by comparing the node depth value stored in the prefix with the actual node depth. If these two do not match, there must have been an unfinished path compression which was interrupted by the crash. The state of the tree can be rolled back by simply decrementing the depth field, and re-computing the actual prefix using any of the two keys reachable from the current node. In the last step, the newly allocated node is linked into the parent node by updating value for key slice “u” to the new node. This update is simply an 8-byte aligned store. The path expansion is completed by flushing the updated pointer in parent node back to the NVM.

The paper also proposes adapting the above algorithm to support persistent Adaptive Radix Tree (ART). ART differs from naive radix tree that it uses multiple node sizes to save memory. There are four node size classes in ART: 4, 16, 48 and 256 (ART uses 8 bit key slice). For 4 and 16 nodes, the key slices for the current level are explicitly stored in arrays, which must be searched to find the pointer to the next level. Write-Optimized ART (WOART) uses append-only update to insert keys into the node, and adds an indirection layer to atomically enable or disable certain slots. Inserted key slices and values are only committed when the update is written back to the NVM. For 48 node, a 256 element key array is used as a direct map from key slices to value pointer array elements. The value pointer array consists of 48 child node pointers. Key slice insert operations on 48 nodes can be achieved by updating the key array atomically with 8 byte stores.