Wormhole: A Fast Ordered Index for In-memory Data Management



- Hide Paper Summary
Paper Title: Wormhole: A Fast Ordered Index for In-memory Data Management
Link: https://dl.acm.org/doi/10.1145/3302424.3303955
Year: EuroSys 2019
Keyword: B+Tree; Trie; Hash Table; Wormhole



Back

Highlights:

  1. B+Trees need lots of key comparisons during the lookup and take O(log(N)) comparisons per lookup. When the keys are lengthy, e.g., string keys, the key comparison can become a significant bottleneck. By contrast, radix trees do not need key comparisons and they take O(L) pointer chasing per lookup, which makes it a better candidate for indexing string keys. However, the high memory overhead of radix trees obstructs its adoption to some extent.

  2. We can combine B+Tree and radix tree by using the B+Tree sorted leaf node as the bottom level and a radix tree as the upper level. An anchor key is generated for each leaf node and then inserted into the radix tree such that the leaf node can be found by searching the radix tree. The anchor keys can be generated as a prefix (plus a possible end mark symbol) of the minimum key in the leaf node as long as anchor keys are not prefixes of each other.

  3. We can further accelerate radix tree lookup by storing the nodes in a hash table indexed by the partial keys of the nodes. Key lookups can then be implemented as a binary search to find the longest prefix that exists in the hash table.

This paper presents Wormhole, a hybrid trie-B+Tree index structure that enjoys the storage efficiency of B+Trees with the lower lookup cost of a radix tree (i.e., a trie). The paper is motivated by the seemingly complementary space-time trade-offs of B+Tree and tries and proposes to combine the two to form a new data structure using path-compressed tries as the upper levels and B+Tree sorted-leaf nodes as the bottommost leaf level. The resulting tree structure combines the storage efficiency of compact key storage at the leaf level and the time efficiency for lookups at the upper levels. Consequently, Wormhole outperforms the more conventional index structures such as tries and even demonstrates close performance to constant-time hash tables with range query support.

The paper begins by noticing the widening performance gap between ordered and unordered indexing structures. Unordered indexing structures, such as hash tables, enable amortized constant-time lookup, hence suitable for point queries but incapable of performing range queries. Ordered indexing structures, such as B+Trees and skip lists, are constructed based on the partial ordering between keys. Lookup operations in these structures require one key comparison at each level, and the number of levels is proportional to the logarithm of the total number of keys stored in the index, i.e., O(log(N)), which degrades as the index becomes larger. Even worse, since most implementations of ordered indexing structures use pointers to link “nodes” together in order to support mutation operations (i.e., insert, delete) efficiently, index traversal is essentially implemented as pointer-chasing, which is known for its bad locality and low memory throughput.

On the other hand, there also exists a type of indexing structure, called radix trees or tries, that gets rid of key comparisons at each level. Instead, in radix trees, search keys are regarded as consisting of a string of “tokens”. At each level during the traversal, the node of the next level is selected using the token as an index into the current node’s pointer array. Radix trees also support range queries just like B+Trees and skip lists if the lexical order of keys is consistent with the partial ordering. In practice, this condition is already true for common key types, such as string keys. Integer keys, however, need to be transformed to big-endian format on small-endian machines (e.g., x86) before they can be used as order-preserving search keys in radix trees. Compared with B+Trees and skip lists, radix trees enable O(L) lookup time where L is the number of tokens in the search keys. As the index grows, the time benefit of using radix trees gradually becomes obvious as O(log(N)), which is the time complexity of key lookups in B+Trees, will become bigger than O(L).

Wormhole combines a B+Tree and radix tree by borrowing the chained leaf structure from B+Tree and borrowing the trie structure for the upper levels. More specifically, in Wormhole, the bottommost level of the structure is identical to the leaf level of a B+Tree, i.e., a linked list of leaf nodes, in which key-value pairs are stored in sorted order. Meanwhile, the upper level of the structure is a radix tree that maps lookup keys (called anchor keys) to the leaf nodes, and the number of keys in the upper-level radix tree equals the number of leaf nodes. During a traversal, the search key is first used to traverse through the radix tree until a leaf node is reached. Then a regular B+Tree leaf search is performed on the leaf node to locate the key-value pair if the key exists.

One naive approach for generating the upper-level radix tree is to select the smallest key in each leaf node, and then construct a radix tree using these keys. However, this naive approach will likely generate a sub-optimal radix tree if the keys can be uniquely identified by only a prefix. To leverage key prefixes, instead of using the full keys to construct the upper-level radix tree, Wormhole proposes that we use the shortest prefix of the first key in each leaf node as the anchor key for constructing the radix tree. In addition, the prefix should be chosen such that none of them is a prefix to the other. The paper describes the algorithm as follows. Given three consecutive nodes at the leaf level, i.e., A, B and C, if the anchor keys are already selected for A and C, then the anchor key for B can be derived as the longest common prefix of the first key in node B and the last key in node A, plus the next token in the first key of node B. The derived anchor key is lexically larger than all keys in node A and is not a prefix of any keys in A. However, the anchor key generated this way can still be a prefix of the anchor key for node C. In this case, the anchor key of B is further appended with an “end mark”, which is conveniently chosen as the smallest token that the paper assumes to be never used. For string keys, the smallest token can be any ASCII control character smaller than any of the printable characters. After appending the end mark, the resulting anchor key for node B is smaller than the anchor key for node C, while not being a prefix of the latter.

Lookup operations on the radix tree are straightforward. The operation starts at the root of the tree, and for each token in the search key, the procedure selects the next node by indexing the pointer array on the current node. However, since the radix tree only stores prefixes (possibly with the end mark) of a subset of the actual keys stored in the leaf nodes, during the radix tree traversal, several cases can happen. First, even if there is no next node in the current level given the corresponding token, the search key may still be in one of the leaf nodes. In this case, the leaf node should be either the rightmost leaf node of the left subtree in the current node (i.e., the largest element smaller than the token) or the immediate left sibling of the leftmost leaf node of the right subtree in the current node (i.e., the smallest element larger than the token). If neither of the two nodes exists, then the search key does not exist, and the lookup returns a failure. Second, the search key may run out of tokens, but the lookup still has not reached a leaf node. In this case, the procedure traverses down the tree by using the end mark as the implicit token until a leaf node is reached. After a leaf node is reached, the lookup procedure performs a binary search on the leaf node as in a regular B+Tree. If the key is found, then the procedure returns the value. Otherwise, the procedure returns a failure.

The procedure for key insertion resembles key lookups in the first half. After a leaf node is reached, the procedure then inserts the key into the leaf node and sorts the node to maintain key order. If the size of the node exceeds a certain threshold, the node is split into two, and the anchor key for the new node is generated and then inserted into the upper-level radix tree. Key deletions are just the reverse of key insertions, i.e., after the leaf node is reached if the key exists in the leaf node, it is removed from the leaf. If the node size drops under a threshold, the node is merged into its neighbor node, after which the anchor key is removed from the radix tree.

To further reduce the number of serial memory accesses in the lookup process, which is O(L) where L is the number of tokens in the lookup key, the paper further proposes to maintain the internal nodes of the radix tree in a hash table. The hash table is indexed using the partial key of the node (i.e., the path from the root to the node), and each entry of the hash table consists of the following fields. First, the entry contains the partial key and the hash value for tag comparison. Second, the entry also contains the type of the node, which is either the leaf node of the radix tree or the inner node. The node type field is used to indicate whether tree traversal is completed or not. Next, if the entry represents an inner node, it also contains a bitmap indicating the status of child nodes, one bit for every possible child. If the bit is “1”, then the child node exists, and tree traversal can proceed down the path to the child node. Otherwise, the child node does not exist. Finally, the entry also contains the range of leaf nodes that the subtree covers. This field is used to quickly locate the leaf node in constant time when one of the two earlier situations occurs during lookups.

With the hashed radix tree representation, key lookups are modified as follows. Instead of following the child link of the current node to traverse down the tree, the lookup procedure now finds the longest prefix of the lookup key using the hash table with binary search, i.e., the lookup begins by probing the hash table with the full key, and if the probing fails, it probes the hash table with the first half of the key, and so on. The probing process continues until the longest prefix is found, after which either a leaf node is reached, or one of the two situations discussed above occurs. In the latter case, the leaf node can be located using the leaf node range indicators of the hash table entry.

To support concurrent operations on the Wormhole structure, the paper proposes a locking mechanism that works as follows. First, every leaf node is extended with a reader-writer lock. Local modifications to the leaf node such as insertion and deletion are performed on the node after locking it in writer mode, and leaf reads should also lock the node in reader modes, hence supporting multiple concurrent readers or an exclusive writer. Second, to support structural changes, i.e., anchor key insertion and deletion into or from the radix tree, the radix tree also has a global lock that synchronizes structural modifications with lookup operations. Operations on the hash table are synchronized with each other using RCU, i.e., every hash table operation generates a new instance of the hash table with the modifications applied. The old instance is replaced with the new one using atomic compare-and-swap, after which it is reclaimed shortly. To maintain consistency between the hash table and the radix tree, the paper proposes to use version numbers. The hash table is extended with a version number that is incremented by one for every modification operation. The version number is read before the traversal and validated against the most up-to-date value after the leaf node is reached. If the two numbers differ, the operation will be retried as the traversal has potentially used a stale version of the hash table.

A few optimizations can be applied to Wormhole to make it more efficient. First, during hash table lookups, tag comparison can be omitted until the last probe where both the has hand the partial key in the entry are checked. If the check passes, the probe succeeds. Otherwise, the probe should restart and always check the partial key on each probe. Second, to accelerate key lookup in the leaf node, which can potentially be quite costly as keys are variable-sized, the paper proposes to store an extra sorted array of hash values for each key in the leaf node. Key lookups can therefore be reduced to binary searches on the sorted hash value array plus a final key comparison. Additionally, newly inserted keys can simply be appended to the end of the node as long as the hash value array remains sorted. Lastly, if the key does not support an “end mark” symbol which is the minimum possible token that will never be used for regular keys, then Wormhole should refrain from splitting a node if the anchor key for the new node contains the “end mark”. In this scenario, the node will become a “fat node” whose size exceeds the threshold. This case, however, is relatively rare according to the paper and hence will not affect the overall performance.