Write-Optimized Dynamic Hashing for Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3323302
Year: FAST 2019
Keyword: NVM; Hash Table; CCEH; Extendible Hashing
Highlight:
-
Good application of extendible hashing to NVM
-
I appreciate the simple persistence protocol of updating from right-to-left. This reminds me of the common technique that many NVM malloc uses – storing size information in the block header, and walk the block headers in the recovery process.
Questions
- This paper is extremely badly written. Grammar and explanations are bad. Examples are difficult to understand. The auhthors may want to improve the explanation of segment split a little as well as the recovery process. Also the paper should make clear where the “local depth” is stored - directory or segment or per bucket?
This paper proposes Cache Line Conscious Extendible Hashing (CCEH), a dynamic hashing scheme for byte-addressable persistent memory. The paper first identifies the importance of dynamic hashing as a foundation of general hash tables, which allows the hash table to expand and shrink according to the workload. In contrast, static hashing does not support dynamically resizing the hash table, which limites its usage to only web caching (in which table entries can be evicted as conflict occurs) or similar applications. The paper then proceed to identify that one of the major difficulties of designing a dynamic hashing scheme is to support efficient resize operation of the hash table. Hash tables are expanded when a conflict occurs but cannot be resolved using the conflict resolution method (e.g. chaining, open addressing). Optionally, tables are shrinked to save storage when the number of elements falls below a certain threshold. In a naive resizing scheme, a larger hash table is built while blocking update operations (reads are not affected since they do not change the state of the table). All elements in the old table are rehashed to the new table, which is expensive in terms of both the number of stores and the computation complexity. When running on NVM, this problem can only be exacerbated due to the lower write bandwidth compared with DRAM.
CCEH is based on extendible hashing, which has been proposed to amortize the overhead of hash table resizing over potentially many operations that are performed after the resize. Elements are rehashed lazily rather than eagerly as in non-extendible hashing schemes in which all elements are rehashed and inserted into the resized table. We now introduce the basic form of extendible hashing as follows. The hash table consists of two parts: A directory array and a set of buckets mapped by the directory. The directory is simply an array of pointers to buckets, which is addressed using bits from the hash value. Conventionally, each bucket is a page allocated from the page buffer pool, but we assume that buckets can be arbitrarily sized based on system parameters such as cache line size and the key-value size. Buckets can be shared by different directory entries, which means that keys whose hash values are mapped to the corresponding directory entries can be stored in the same bucket in a fully-associative manner. Read operation to a bucket must check all non-empty slots in order to decide whether the search key exists or not. Two runtime parameters decide the behavior of operations. The first parameter is the global depth G, which indicates the number of bits we use from the hash value to determine the directory. For example, if the hash value is 0x12345679, and G equals 4, then we take four bits from the value (assuming we use LSB), which is nine, as the index of the directory entry we need to lookup. In addition, the the size of the directory array is also determined by the value of G, which is 2^G. In our example there will be 16 elements in the directory. When the hash table is expanded, we double the number of elements in the directory, and increment the value of G to make them consistent. The second parameter is l, the local depth of a bucket. Each bucket stores the value of l in its header, and this value encodes an important property of key-value pairs that can be stored in the current bucket. In order for a key-value pair to be stored in buck with depth l, the first l bits of the directory index must equal the bits in the key’s hash value on the same bit position. In other words, instead of allocating one bucket to each directory entry and only letting it store keys whose hash value is mapped to the directory entry pointing to the bucket, we allow multiple directory entries to point to the same bucket, and hash values of keys mapped to the bucket have a common “prefix” whose length is exactly l. The remaining (G - l) bits of the hash index can be arbitrary value.
It is quite obvious that directory entries cannot be mapped to buckets arbitrarily. In order for entries E1, E2, …, Ek to map to the same bucket whose local depth is l, the indices of these directory entries must have a common l bit prefix. When we split a bucket when it is full, the local depth of the bucket is incremented if it is less than G, and then we rehash elements in the bucket as follows. First, a new bucket is allocated to hold rehashed key-value pairs. Next, for all active slots in the old bucket, we divide them into two groups: The first group consists of keys whose (l + 1)-th bit equals 0, and the second group consists of the rest. We then move the second group to the new bucket. The directory entries E1, E2, …, Ek are updated, such that those whose indices has (l + 1)-th bit equalling zero still point to the old bucket (i.e. do not change since they already point to the old pre-split bucket), and the rest will be updated to point to the new bucket. This way, the invariant between local depth of buckets and the directory entries that point to them are maintained.
When inserting into the extendible hash table, we first compute the hash value of the key, and then extract the index of directory entry using G. We then find the bucket by following the pointer stored in the directory entry. If an empty slot can be found in the bucket, then we place the key-value pair in the bucket, and insertion succeeds. Otherwise, we check whether the value of l is smaller than G. If true, we split the bucket as described above, and retry the insertion on one of the two buckets depending on the (l + 1)-th bit of the hash value. Note that in the most extreme case, the bucket to be inserted can still be full after insertion (i.e. there is no element whose (l + 1)-th bit is 0 or 1), the split process may therefore be recursively called until at least one element is copied during the split. If, on the other hand, the local depth l of the full bucket equals G, then the bucket can no longer be splited since there is no sharer. In this case, we must resize the hash table by allocating a new directory array whose size is the double of the current one, and initialize the array by copying the current array twice into the new directory array (i.e. copy the old array into both the first half and the second half of the new array). Copying the current array twice essentially shares all buckets with newly added directory entries at the second half of the new directory array. This way, no rehashing is required, since the newly added directory entries simply share the same bucket with their buddies in the first half. The global depth G is also incremented to indicate that all buckets are now shared. The actual cost of rehashing is postponed to the point where a bucket is full.
This paper further extends the above baseline extendible hashing to NVM by adding a few optimizations and directory update protocols such that partial updates during a bucket split can be identified and fixed in the recovery process. We introduce them as follows. The first change between CEEH and baseline extendible hashing is that buckets are organized into segments, which is the basic unit of allocation and split. This saves memory space for the directory at the cost of copying buckets that are not full during a segment split. In the following discussion we use G to refer to the depth of the hash table with regard to segments. This paper suggests that we use G bits from the MSB in the hash value to address directory entries to make recovery simpler, as we will see below. In this scheme, given a hash value in binary as b1b2b3b4..bG…bH where H is the size of the hash value, we use b1b2…bG from the MSB as the index for the directory. Within a directory, we use bits from the LSB to address buckets.
The second change is that CEEH updates directory entries from right to left when splitting a segment. Given global depth G and local depth of a segment as l, we make the following observations. First, before the segment is split, directory entries that share the segment form a consecutive range in the array, namely from 00..00..0 to 11..10..0, in which the number of “1” bits in the higher end is l, and the total number of bits is G. This is a result of using MSBs of the hash values as directory indices. The second observation is that after the segment is split, we increment the local depth l for both the old and the new segments, and then update the upper half of this range in the directory to point to the new segment (assuming we copy entries in the buckets whose (l + 1)-th bit is “1” to the new segment). This process is similar to the buddy allocation system, in which a chunk of memory whose size is a power of two is broken down into two equally halfs until the desired size (rounded up to power of two) is reached. The last observation is that no matter how segments are split, the number of directory entries that point to the segment is always 2(G - l) (l <= G), and these entries are always consecutive in the directory array as we have explained above. This property is critical to the recovery from half-split state, as we will see below.
Insertion and deletion do not change much from the baseline system. For insertion, instead of only searching for empty slots in the current bucket indexed by the LSBs, the paper suggests that we may also try linear probing in the segment, i.e. we allow key-value pairs to overflow to adjacent buckets if the current bucket is full. Hash table lookups must also scan the next few buckets until the search key is found. In order for inserts and deletes to be atomic with regard to failures, inserting threads should first write the value, then the key, and finally use a per-item flag to indicate that the key-value pair has been inserted. This per-item flag could be the field that stores the hash value of the key. For deleles, we simply invalidate the per-item flag.
When spliting a segment, we copy all key-value pairs whose (l + 1)-th bit is “1” to the corresponding buckets in the new segment. Entries in the old segment are cleared lazily, as they do not affect correctness for reads. Updates will be redirected to the new segment after the directory is updated. When searching for an empty slot in a segment, those key-value pairs whose l MSB bits does not equal the l MSB bits in the hash value (i.e. index of the directory entry) will be considered as invalid, since according to the definition, only those items whose l MSB bits equal the l MSB bits of the entry index are valid. After creating the new segment, we first persist the segment after populating it. We then update the higher half of the directory range (see above) from right to left, i.e. from higher indices to lower indices. Cache line flushes and memory fences are inserted when directory entry updates cross cache lines, and at the end of the update. This ordering is extremely important to ensure that we can recover from a failure half way through the update process.
On recovery, we only need to fix potential inconsistencies caused by partially conducted segment split. All other operations, such as inserts, deletes, and hash table resize, can be rolled back automatically if the final step which exposes the changes to external world was not flushed back to the NVM before the crash. To identify a partial directory update, we perform a directory walk beginning from the first entry. Based on the observation that under a consistent state, a segment whose local depth is l will have a consecutive range of length 2G-l in the directory, we just verify that this invariant holds for all ranges that point to the same segment in the direcory. To elaborate: Recall that the update process will update from right to left. As long as the update is not fully completed, the leftmost entry of the new range still point to the old segment. To detect this, we begin from the first entry in the directory. For each range whose length is calculated using the segment local depth pointed to by the first entry of the range, we read the last entry in the range, and compare whether these two entries point to the same segment. If not, then we have identified an incomplete directory update as a result of a split, which is then completed by updating all entries in the newly created range.