Recipe: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes



- Hide Paper Summary
Paper Title: Recipe: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes
Link: https://dl.acm.org/citation.cfm?id=3359635
Year: SOSP 2019
Keyword: NVM; B+Tree



Back

To be short: The elegant definition of “non-blocking” actually already covers the case where threads crash in the middle of an operation. Non-blocking data structures typically require some sort of help-along protocol that transforms data structures from consistent states to the next consistent states. This help-along protocol can be used without any change for post-crash recovery, since system crashing is just all threads suddenly exiting the data structure. If the data structure is truly non-blocking, this is just a normal case.

Important observation: Recovery of data structure is recovery of control flow information and execution context, if data can be persisted after they are written. (i.e. make sure the NVM image is the same as the DRAM image)

This paper propes Recipe, a code transformation technique for implementing failure atomic persistent data structures from their concurrent but non-persistent counterparts. The paper is motived by the fact that persistence and isolation are, in fact, similar to each other since they both require that data structure be able to recognize partially updated states, and fix them cooperatively during execution. In other words, most data structures require that operations be atomic with regard to other concurrent operations. This can be achieved by making the changes publicly available using one atomic step. In the meantime, if we need failure atomicity on data structure, they must commit all changes to the NVM and make them available to post-crash recovery routines in one atomic step, such that either all necessary information is on the NVM for replay all the changes, or none of them is available after recovery.

The paper makes the following assumption. First, it is assumed that either a NVM-specific memory allocator is used to replace the volatile allocator, or the language runtime is able to perform garbage collection. In either case, the garbage collector should ensure the consistency of its own metadata internally to avoid corrupting internal metadata or allocating the same block more than once. If a memory block has been allocated (i.e. removed from allocator’s free list) but not yet linked into the data structure, it will be lost after the crash since the ownership transfer is non-atomic. The garbage collector should be able to find these leaked blocks either by an offline scan before the restart, or by performing online background GC. The second assumption is that read operations must be non-blocking, while write can be blocking, but there should be a way of transforming writes to perform atomic updates between consistent states. As we will see later, this assumption forms the basis of code transformation. The last assumption is that the data structure’s concurrency support must be correct. A buggy parallel implementation will result in a buggy persistent implementation.

On a high level, Recipe works by allowing intermediate states caused by modification operations on the data structure, including data updates and structural updates, to be exposed to concurrent threads. Without proper synchronization, such intermediate states will lead to errorneous behavior due to loss of information (e.g. maybe some information is temporarily copied to the local execution stack of the thread carrying out the modification operation), or due to contradictory metadata (e.g. metadata update is non-atomic), or read corrupt data (e.g. data update is non-atomic, and other threads read partially written data). To ensure the correctness of updates, data structure must satisfy the following requirements when performing updates. First, each step in a single update operation only atomically transform publicly available states from one consistent state to another. This can always be done by performing log-structured updates: A log pointer or “validity bitmap” is maintained for each object that will be updated concurrently, which marks the end of the currently available log or indicates “active slots” in the object (if it consists of slots, such as a B+Tree node). In order to perform atomic updates, threads first allocate some space at the end of the log or acquire a free slot. Then threads updathe the object by appending to the log or writing data into the unused slot just acquired. During this process, data appended to the log or written to the slot is unavailable to other readers. In the last step, the updates are committed by atomically moving the log tail pointer or setting the bit in “validity bitmap” by using an atomic CAS. The second requirement is that threads must be able to recognize such intermediate states, and help-along when they observe an incomplete update. To achieve this, either all update operations on objects explicitly identify themselves, like in BwTree and BzTree (PMWCAS), by posting a record indicating an incomplete operation, which contains the operation type and arguments for completing the operation, or update operations are carefully designed such that they create unique intermediate states with sufficient information to complete or roll back the operation. This property makes crash recovery trivial, since after the crash, threads simply fix the data structure in the intermediate states back to normal using the same logic as it does during normal execution. The last requirement is that data structure operations must be non-blocking, or at least can be transformed to non-blocking with reasonable effort. For reads, when they observe intermediate states created by concurrent update operations, readers must be able to figure out a way of fixing the states, instead of retrying or blocking on the temporararily created intermediate state (e.g. spinning on a lock). This is because a blocking read protocol expects some other threads to finish the operation and restore consistency within finite number of steps. This is, however, unfortunately not true after a recovery, since no thread is currently working on restoring the state back to normal, in which case helping along each other is necessary.

The paper enumerates three types of data structure synchronization techniques where such transformation can be done easily. In the first type, all data structure updates (including data update and structural update) are made available via an atomic store or CAS. In this case, simply persisting the last store using a persistence barrier suffice, since the state of the data structure only changes on these persistence points. Crashes happening between the two persistence points will automatically roll back the incomplete update. On recovery, the visible state of the data structure is still consistent, while the “invisible state” (such as log entries beyond the tail pointer, or data in inactive slots) might be inconsistent. The data structure should be able to tolerate such inconsistency, as they will also occur during normal operation.

In the second type of synchronization, updates are not atomic, which consist of several atomic steps, and each atomic step transforms the data structure into consistent state that can be identified as a result of pending updates. A typical example is the BwTree, in which node split and merge are not atomic, but they post deltas on the node to be splitted or merged to indicate that a structural modification (SMO) is going on. Concurrent threads will help to complete this SMO before they can proceed to their own work, serialization operations based on the order that they finish posting the delta record via CAS. For this type of data structures, the paper suggests that we issue a persistence barrier on dirty data after each atomic step. On recovery, the data structure will always be in one of the several possible intermediate states, which can be identified and completed by threads just like normal after restart.

In the thrid type, updates are non-atomic, and do not preserve the consistency of the data structure. To compensate, updating threads must synchronize against each other and against reading threads using locks. Transforming data structures of this type is non-straightforward, since consistency can be maintained with regard to concurrent threads, while not with regard to failures. In other words, state changes in critical sections are undefined, which may leave the data structure in a state that is difficult or impossible to identify. To solve this problem, the paper proposes first transforming the data structure to the second type, and then making it persistence by adding extra persistence barriers. The essence of the transformation is to allow control flow information to always be inferred from the post-crash state, i.e. by reading an inconsistent state, threads should always be able to identify the exact point where the crash happens, and hence be able to complete the operation from that exact point. If exact points cannot be identified, but instead we can only know that the crash happened between two persistence points, we must further guarantee that all steps between these two points are either local operations, or are idempotent, such that re-executing them during recovery still result in the same state. With these guarantees, the recovery thread first recovers control flow information by identifies the exact point (or the range) where the crash happened. Then the recovery thread tranfers its control flow to the point (or the starting point of the range) to complete the information, as if the crash had never happened, and that it were the original thread attempting to complete the operation.

In addition, the paper also proposes a better way of testing NVM programs. Prior researches either interrupt the program randomly during normal execution, or try to enumerate all possible contexts, and test each of them. This paper observes that the global visible state only changes when the atomic update operation is committed to the NVM, and therefore proposes only interrupting the program before and after each persistence barrier. The crash is simulated by conducting a longjmp from the context directly to the previously saved checkpoint, leaving the data structure in an intermediate state, after which point crash recovery can be invoked.