Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections



- Hide Paper Summary
Paper Title: Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections
Link: https://dl.acm.org/doi/10.1145/2814270.2814312
Year: OOPSLA 2015
Keyword: HAMP; CHAMP; Persistent Data Structure



Back

Questions

  1. This paper is written badly, with lots of jargons and concepts unexplained (what are “elements”? It took me a while to figure out that the authors are referring to terminal level values, which is to differentiate it from next-level child nodes). I inferred most parts by guessing instead of reading the actual text. Some terminologies are not used properly, such as “memorization”, which is often used to refer to a programming technique that is used in top-down recursion. One example of bad grammar is on page 8, right below listing 6: “With MEMCHAMP we will refer throughout the text to the variant of CHAMP that adds memoized element hash codes, but drops incremental collection hash codes.” Can you be more obsecure with this sentence structure?

  2. I did not see how memory footprint is reduced. In the HAMT design, each node must contain 32 slots (actually, 64, since full keys are also stored). In CHAMP, although empty slots are removed using nodeMap and dataMap, in the worst case where the node is full with all terminal values, 64 slots are still needed, unless the array can be extended dynamically, which is not mentioned at all in the paper. Cache locality is improved, though, since useful values are likely closer to each other in a sparse node.

This paper introduces Compressed Hash-Array Mapped Prefix-Trees (CHAMP), which is an improvement over the existing Hash-Array Mapped Tries (HAMT). The paper identifies four problems with a naive HAMT. The first problem is that tries (radix trees) consume too much memory by maintaining a full sized node when most of the slots are empty. This both causes, excessive memory to be allocated for storing non-meaningful NULL values, and hurts cache performance, since the memory footprint of nodes become larger. The chance that a node access will hit the hardware cached copy decreases compared with a more compact representation. The second problem is deletion. The original HAMT design lacks a proper deletion algorithm, such that the radix tree cannot be restored to the canonical shape, resulting in sparse nodes and singleton paths, wasting both memory and cycle. The third problem is bad cache locality of iteration, due to the fact that child node pointers and elements can be stored in an interleaved manner in a direct mapped node. The iterator will have to traverse to childen nodes recursively before returning to the current node and consinuing iterating on the current node, resulting in poor cache locality. The last problem is equality checking, which involves iterating over all elements stored in one radix tree and comparing the element against elements stored in another tree. Equality checking is inefficient without a canonical representation of trees with the same content, and with the poor cache locality of iteration.

The work of CHAMP is built on HAMT, a radix tree-based hash table implementation. Hash values of key objects are inserted into the radix tree, with the object itself being the mapped value. HAMT can be used as both sets and maps. If only key objects are mapped, the HAMT instance essentially serves as a set. Membership of an object can be checked, by first hashing the object, and then using the hash value to perform a tree lookup. Set membership is indicated by a successful lookup. If an extra value object is stored with the key object, then the HAMT serves as a map, which returns the value object given the key object following the same lookup protocol.

We next describe the baseline HAMT in details. The hash key is divided into 5 bit slices, from the LSB to the MSB, which is used as search indices at each level. Given a fixed size of hash values, the maximum depth of the tree is also constant (upper_bound(|H| / 5) where |H| is the number of bits in the hash value). Each node in the baseline is a 32-slot direct mapped array. The 5-bit key slice is used as an index into this array to fetch the element. HAMT supports eager termination of the traversal at level L (L < maximum depth), if the remaining slices are not needed for disambiguation. In other words, element insertion can stop at level L, if the corresponding slot at level L is empty, and just store the object at level L without further extending the path to the leaf level. On future insertions, if a slot value is an element, instead of a node, we lazily expand the path by allocating a new node, and attempt to store both the new and the existing element on the new node. If this is impossible, indicated by the fact that their key slices for level (L + 1) have the same value, we further extend the path by allocation another new node at level (L + 2). This process is repeated, until leaf level is reached, or until both elements can be stored at different locations of the newly allocated node. In a moderately sparse HAMT, this feature greatly reduces the number of steps in the traversal, as well as the number of singleton nodes in the tree.

Path compression, however, is not performed during deletions. In the above example, if the second key inserted into the HAMT is deleted later, the structure of the tree will remain what it was before the deletion, instead of “shrinking” to the pre-insertion state. This not only introduces unnecessary levels, but also creates difficulties in equality checking, since two HAMTs holding the same content may not have identical tree structure. In other words, the shape of the tree is a function of both the current content and the insert-delete sequence that has been applied.

Hash collision of keys are rare, but possible. On hash collisions, multiple different keys will be mapped to the same slot. The insertion function should always check whether the keys are identical, and if not, uses a conflict chain of keys (and values, for maps) to resolve conflicts. We call a node that contains hash conflicts in its slots as an overflow node. Note that overflow nodes can only occur at leaf level, since at non-leaf level, conflicts are resolved by creating a new level, and diffentiating the two conflicting key objects using the next key slice. A collision is inevitable only at the leaf level when all slices have been exhausted.

CHAMP optimizes HAMT over node layout, deletion protocol, equality algorithm, and fast iteration. Many of these optimizations are not standalone, but will rely on each other or share the same infrastructure. We next describe these optimizations in details.

HAMT node layout is a 32-slot static array, to which key slices are direct mapped. This, however, always allocates 32 slots in the array, which can be more than necessary, if the node is sparse. In addition, the locality of node traversal is affected by NULL values. CHAMP improves over HAMT node layout with three techniques. First, NULL values are not stored, which reduces the size of the array. Two 32-bit integers, used as bitmaps, encode the logical layout of the array as follows. A nodeMap encodes slots that store a next-level child node pointer. Each child node pointer only takes one slot. A dataMap encodes slots that store a terminal values, which can be either a single key object for sets, or a key and value object for maps. In the latter case, each logical slot in the dataMap actually requires two physical slots, which are allocated as two adjacent slots in the array. The invariant is that at most one bit in nodeMap and dataMap is set. To avoid complicated offset computation, the array of slots in CHAMP nodes are allocated from both ends, one for node pointers, and another for key or key-value pairs. Elememts in the array are compactly stored regardless of the logical NULL pointers in-between. The offset of a logical element in nodeMap at logical index X is computed by masking off bits in nodeMap after bit X, and performs a popcount of “1” bits in the resulting mask. The offset of elements in dataMap is computed similarly, but the result of popcount is further subtracted from the size of the current array, since the physical dataMap starts at the end of the array, and grows towards lower indices. The size of both logical arrays are computed by counting the number of “1” bits in the masks. If, before an insertion, the physical array is full, indicated by the sum of bit counts from both masks equalling the array size, then a reallocation will take place to extend the array before the insertion could be performed.

The second improvement is the deletion protocol. HAMP deletions do not shirnk the path, resulting in unnecessary intermediate nodes that do not contribute to disambiguation using key slices, wasting both storage and cycle. This paper proposes n deletion protocol that “folds” singleton paths from the most recent non-singleton node to the bottommost node, if the deletion results in such a singleton path. To further help explanation, the paper defines the arity and branchSize property of a node as follows: The arity of a node is the logical number of elements, either data or next-level child pointers, stored on the node’s array. The branchSize of the node is the number of logical elements that can be reached transitively from the current node. Although branchSize is not a local property, since it requires recursively traverse the subtree to compute the number of elements, it can be computed approximatly by only counting the number of elements in the node and child pointers, assuming child nodes will also satisfy this property. If there is at least one child node pointer, the branchSize must be more than one, since the assumption may apply recursively to the child node. If there is no child node pointer, but more than one elements, the branchSize is also more than one. Otherwise, the branchSize is either one or zero, depending on the number of elements in the node.

The deletion protocol is described as follows. The process is naturally recursive. Recursion proceeds by checking whether the logical slot contains a next-level child pointer or an element. If it is a child pointer, the function recursively calls deletes on the child node. Otherwise, the key object of the element is compared against the deletion key. If keys match, the key is removed from the node. Otherwise nothing happens. In both cases, recursion stops expanding, and returns a pair of values to the caller. The return value consists of a boolean variable indicating whether the key is actually deleted, and the node, regardless of whether any deletion happens. The flag will be checked first from the return value. If it is false, then the upper level recursion instances know the deletion did not happen, and will not modify the tree structure. Otherwise, the function checks the arity of the current node, and the branshSize of the returned node. If the current node has arity being one, and the branchSize of the returned node being one also, then we have identified a singleton path to a leaf node of length at least two, and the current node will be removed (and deallocated) from the path by returning true and the returned node to the upper level. If the current node’s arity is larger than one, but the returned node’s branchSize is one, then we copy the only element from the returned node to the current node, replacing the child node pointer with the element copied from the returned node. If a singleton path is to be folded, this is the last step of the folding step, as it takes the element from the singleton path and embeds it in the current node. In all other cases, the current node simply updates the child node pointer that it traverses down the tree with the returned value, if the value is different from the current one, and returns true and itself to the upper level.

The deletion protocol creates a canonical form of CHAMP: No matter how elements are inserted and deleted, as long as the final contents of the tree are identical, the shape of trees must also be identical. This enables structure-based equality equality checking, which is different from content-based ones where one tree is iterated while the other is checked against the key. The structure-based equality checking algorithm is describe as follows. For non-overflowing nodes, which are all inner nodes, and leaf nodes that have not seen hash conflicts, the algorithm first compares nodeMap and dataMap. If one or both of these two differ, the two subtrees must be different. Next, all elements in the current node’s array is compared using key object comparison. In the last step, all child nodes are recursively compared recursively. Equality checking passes if and only if all these checks pass.

For overflowing nodes at leaf level, equality checking first compares the per-node hash value. This value is updated whenever the node content is updated by inserts and deletes. The hash must be order agnostic, and reversible, such as the XOR of individual key object’s hashes. If hash check passes, we then check keys in the slot one by one with keys stored at the same offset. Chained key objects should all be compared. The equality checking passes if and only if all these checks pass.

The special node structure also optimizes iteration, since iteration now can first traverse over all values from the upper half of the array, and then recursively traverse child nodes. This way, cache is better utilized, since the traversal function localizes accesses to the upper half of the array first, before it recurses down.

The paper also proposes storing hash values for the entire tree, each node, and each element in the node. As stated above, the hash code must be order agnostic and reversible to handle deletions. Storing hash values have at least three advantages. First, comparison of a single node is faster, since different hash values must indicate different nodes. Second, key comparisons are also faster for finding non-equivalent keys, if the key comparison function is non-trivial. Lastly, when two key values collide on a non-leaf node, the hash value of the old key does not need to be computed. The hash value will be needed for expanding the tree. Note that if value objects exist, insert function should also compute the hash value of value objects, and incremently update the per-node hash value, if applicable. The paper suggests that the hash value be stored in a separate array in the node, the size of which is also expandable as the array expands.