Fine-Grain Checkpointing With In-Cache-Line Logging
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=3297858.3304046
Year: ASPLOS 2019
Keyword: NVM, Undo Logging; Masstree
Highlights:
-
Taking advantage of the fact that a pointer in x86-64 only has 48 effective bits.
-
The InCLL logging scheme works equally well for sequential insert/delete and random update, leveraging the fact that inserts/deletes (but not the mixture) can be undone rather efficiently just by restoring the permutation field.
Questionss::
-
wbinvd is a previleged instruction. On epoch boundaries a system call must be performed to flush the entire cache.
-
When the high bits of the current epoch cannot be represented by nodeEpoch during value logging, we can simply perform a “no-op” permutation field logging first, which updates the nodeEpoch with current epoch, and then the value log entry can use the higher bits of nodeEpoch.
This paper presents In-Cache Line Logging (InCLL), a novel technique for accelerating undo logging in certain non-volatile data structures. InCLL is motivated by the fact that traditional undo-based logging has an important implication: the undo log entry must reach the persistent NVM before dirty data does, because otherwise, when the system crashes, all in-cache data will be lost, and there would be no way to recover the system state to the before state using undo image. With today’s commercial hardware, the write ordering between the log entry and dirty data (which can reside in different cache lines) must be enforced using cache flush and memory fence instructions, which can be inefficient if log flushes are frequent.
Instead of precise undo logging, InCLL adopts an epoch-based checkpointing model, in which the data structure is checkpointed for every 50ms. The interval between two consecutive checkpoints is called an “epoch”, during which the data stucture works normally as the volatile version. On the epoch boundary, all worker threads are blocked until the checkpointing is completed, leaving the data structure in a consistent state (i.e. no operation is currently going on). A seperate checkpointing thread then flushes the system cache (using “wbinvd” instruction), after which all dirty cache lines are persistent in the NVM, making the NVM image also a consistent state. After a system crash, it is guaranteed that the state of the data structure can be restored to the consistent state in the previously completed checkpoint by applying the undo log entries to objects that are modified in the failed epoch.
This paper uses Masstree as the platform to demonstrate how InCLL works with common data structures. Masstree is a combination of B+Tree and Radix tree. Instead of having “flat” nodes as in a regular radix tree, a B+Tree is used to support extremely large fanouts at each radix tree level. In masstree, each radix tree level has 2^64 fanouts, mapping a key of 64 bits to one of the next level B+Tree. The B+Tree, therefore, has fixed key length (8 byte binary data) and value length (8 byte pointers). To further reduce data movement when inserting or deleting elements, a permutation field is added into each node of the B+Tree, such that the elements need not be sorted. The permutation field is a 64 bit word mapping elements from their sorted location to physical location within the node, which can be modified atomically with regular store instructions if properly aligned. Each node of the Masstree consists 15 key-value pairs, plus the permutation field.
InCLL extends the non-volatile Masstree to support efficient crash consistency based on the following observations. First, if the undo log and the dirty data resides in the same cache, no write ordering needs to be enforced, since existing hardware all guaratee that a cache line is persisted atomically, i.e. either the entire cache line is persisted, or none of it is persisted. Placing the undo log entry in the same cache line as dirty data therefore can reduce the cache flush and memory fence overhead, since either the dirty data is written back to the NVM during exeution, and so does the undo log entry, or both of them are wiped out by the crash, doing no harm at all to the NVM image. The second observation is that only the permutation field needs to be logged when inserting or deleting of elements from a node, since a physical slot can be invalidated simplying by writing a special value into the permutation field. Initially, all fields in the permutation field indicates invalid mapping. When a new element is added, the key and value is appended to the physical array, and the permutation field is modified to map the key’s logical slot to its physical location in the node. In order to undo this insert, simply restoring the permutation field to its before state is sufficient, as the newly inserted key and value will be demapped once the original permutation is restored. The same reasoning also applies to deletion of keys. The last observation is that, for random update workloads, the chance that a node is updated more than once is rather low. InCLL can therefore optimize for one update per epoch with in-cache line logging, and fall back to external, object based undo logging if this does not hold.
InCLL changes the node layout of Masstree as follows. In the volatile version of the Masstree, a node consists of a permutation field, a metadata field, an array of 15 keys, and an array of 15 values, fitting into four 64 byte cache lines. InCLL changes the node layout by adding extra logging related field as follows. First, all nodes must be allocated to 64 byte aligned addresses. Second, an undo image of the permutation field and several control bits are added to the first cache line which also stores the permutation field. A “nodeEpoch” field is also added to record the most recent epoch in which the content of the permutation field is logged. Second, a Masstree node now only stores 14 key-value pairs. The 14 8 byte keys are stored in the next two cache lines following the permutation field and logging fields. Then, in the fourth cache line, we store an 8 byte log entry for key 0 to key 6, and in the fifth cache line, we store key 7 to key 13, and then another 8 byte log entry for the last seven keys. In this modified node layout, both the permutation log and the two value log entries are stored in the same cache line as the dirty data, leveraging the first observation in the previous paragraph.
When an element is inserted, assuming no node split happens, the permutation field is first logged, and then the new key and value is written into the physical slot. The nodeEpoch field is also updated to indicate that a log entry has been written for the permutation field. The actual permutation is only updated after all of the above are written. Note that although a cache flush and memory fence is not needed here, this write ordering between the log entry and the in-place update is still critical to the correctness of the algorithm. In addition, the nodeEpoch must only be updated after the log entry is written, because otherwise if the system crashes between these two writes, the nodeEpoch will indicate a log entry is available, while the entry is in fact not written yet, causing recovery failure. The next time an element is inserted, we first check whether nodeEpoch equals the current system epoch. If this is the case, no extra logging is needed, since it is sufficient to just log the permutation once.
Element deletion is logged in a similar manner. The only corner case that requires special handling is when an element is first deleted, and then a new element is inserted into the same physical slot. In this case, simply restoring the permutation field after a crash is not right, since the content of the key-value array has also been changed, while such change is not rolled back (the deleted pair will be lost, since its slot is recycled and allocated to a new pair). If this happens, the paper suggests that the node be externally logged to avoid data corruption. Two extra control bits, isLogged and insAllowed, controls whether the node has been externally logged (and hence no further logging is needed for the entire node), and whether element deletion has happened such that all future inserts should trigger external logging.
Logging node updates requires more delicate handling of the log entry, as we will see below. As the log entry only has 8 bytes, it would be difficult to squeeze both the before image of the modified value, the index of the value, and the epoch in which the log is written without taking advantage of the special property of the value stored in Masstree. Fortunately, it is noted that Masstree values are always pointers. On x86-64 platform, a virtual address pointer only has 48 effective bits. Bit 48 to bit 63 are not used, which just duplicate bit 47. Furthermore, since most allocators nowadays will align memory blocks to 16 byte boundary, the lowest four bits are always zero. The logging process is described as follows. First, the before value is copied into the log entry, with bit 0 to 3, bit 48 to 63 masked out. Then, the index of the element and the 16 bit epoch is written into bit 0 to 3 and bit 48 to 63 of the log entry field respectively. The same write ordering applies as in element insertion. The true update is only performed after the updates to the log entry is completed. Note that in this scheme, the epoch number is only the low 16 bits of the current epoch in which the value log is generated. We assume that the higher bits of the current epoch can be represented by the higher bits of the nodeEpoch. If this is not true, or the log entry in the same cache line as the value to be updated has already been used, external logging is used, if not already.
Each node is only externally logged once during the same epoch, even if the object can be updated multiple times. This is achieved by adding another external log epoch number into the node. Before a node is to be externally logged, we check whether this epoch number equals the current epoch. If true, the logging is skipped, since the before image of the node already exists. Only logging each node once during an epoch helps recovery also, since the restoration can be done in parallel. Otherwise, if there are multiple undo images for the same node, these undo images must be applied in the reverse order they are created. The paper also suggest that only leaf node updates be logged with InCLL. All other modifications, such as node split/merge or internal node updates, are externally logged, since these modifications are relatively infrequent.
During recovery, the external log is first copies back to their original locations in parallel by a few recovery threads. InCLL logs may also be recovered at this stage, but the paper points out that since there can be a significant number of leaf nodes, scanning all of them can be time consuming, while only a small subset of leafs have a log entry. It is hence recommended that leaf nodes be restored lazily during normal execution after the crash and recovery. The tree traversal routine first compares the epoch in which the log entry is written against the failed epoch, and if there is a match, the content of the node is first recovered by copying from the in-node log.