Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory

- Hide Paper Summary

Paper Title: Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
Link: https://dl.acm.org/citation.cfm?id=1960480
Year: FAST 2011
Keyword: B+Tree; NVM; Versioning

Back

Highlight:

Novel adoption of versioning and multiversioned data structure to the realm of persistent data structures.

Questions

The assumption that flush() can only ensure the atomicity of 8 byte stores are overly restrictive and yet incorrect. It is overly restrictive because on x86 and most other architectures, cache lines are flushed back to the NVM in the granularity of 64 byte blocks (or other granularities), the atomicity of which is implied by the memory controller. It is incorrect because the paper did not explicitly mention that even for 8 byte pointers/integers, it is necessary to align their addresses to a multiple of 8. If the compiler aligns by multiple of 4, then it is possible that the 8 byte memory object is mapped across a cache line boundary.
The paper cites an external technical report on HP research wrbsite, but unfortunately this report was never published.
This paper does not cover multi-word update of an object (e.g. key object) in CDDS. Multi-word update can be tricky, because when you update it, it is still visible to the current operation (because otherwise how the operation finds it?), and hence flusing must be atomic, which is not guaranteed. One safe way is to decompose the update into two steps, first delete and then insert. This, however, brings a new problem which is the atomicity of these two steps.

Combined with the discussion in the next point, I truly believe this paper should just say that flush() is 64-byte atomic and that objects must be mapped to a single cache line, which guarantees tha atomicity of single object update. As for multi-step operations such as a tree structural modification, using versioning is still the right choice which is exactly the merit of this paper.
I don’t believe the algorithm given in the paper is correct. For example, in Algo.2 in which an existing item is reused in the node, in line 7 to line 10, the paper simply assumes that we can update the timestamps and keys, and then issue a flush. This, however, does not guarantee the atomicity of the three updates (i.e. they can be partially written back to the NVM). Imagine if the system crashes after “n[entry_num].end = 0” is written back and before the other two fields are (this is possible since the cache can evict any block at any time). In this case the reused node will be made visible again, with the old key and old start timestamp, which is indistinguishable from the originally inserted node.

It would be better for the paper to assume that the flush() routine guarantees 64 byte atomicity and forces all objects to be mapped into a single cache line. This way, either all updates of fields are visible atomically, or none of them is visible.

This paper presents Consistent and Durable Data Structures (CDDS), a software technique for ensuring the consistency of data structures on Non-Volatile memory (NVM). As NVM devices are directly attached to the memory bus, it is difficult to reason about persistence state changes due to the fact that memory writes can be made persistent on the device in arbitrary order. Even in the case where certain memory orderings are enforced, performing a data structure operation often requires several writes on non-continuous locations, making it non-trivial to guarantee atomicity, the lack of which can introduce inconsistency to the data structure.

This paper proposes two levels of abstraction of implementing a CDDS. The first level implements an interface which flushes a series of cache lines back to the NVM, acting as a persistence barrier. All instructions executed after the persistence barrier can safely assume that previous writes have made to the NVM. The interface, named flush(), however, only supports 8 byte atomic write to persistent storage. In other words, if system crashes at the moment flush() is being executed by the processor, some cache blocks may not reach the NVM before the crash, the content of which will be lost.

The flush() interface consists of a memory barrier, mfence, a series of cache flushes, and another memory barrier (the paper does not mention pcommit, largely because when it is written pcommit has not been part of the proposal, which was later on deprecated). The first memory fence stalls the processor until the write buffer is emptied, which ensures that all previous memory writes reach the cache when the cache flush executes. Thie is necessary, because otherwise, the actual cache write may be reordered after the cache flush due to relaxed memory consistency. The clflush instructions in the middle discard cache lines that are currently in the hierarchy, stalling the processor if the cache line is dirty until the write back finishes. Note that at the time this paper was written, clflushopt has not been proposed yet as an optimized version of clflush. The clflush instruction may negatively affect performance because it invalidates cache lines from the cache, while in fact what we need is merely a write back (and coherence state change). The last memory fence orders the flush instruction with instructions that follow to make sure no memory operation will be performed (and then evicted by the cache) before the current flush sequence completes.

The second level of abstraction leverages versioning to guarantee that all changes are made visible atomically even if the process may involve several memory updates to different cache lines, which we explain as follows. CDDS maintains a 64 bit integer, mapped to the NVM, as the “current time” timestamp. This timestamp stores the current logical version the data structure. Every “object” in a CDDS has two fields, a begin timestamp (bts) stores the minimum timestamp required to access the data structure, and an “end timestamp” (ets) which is the highest timestamp that can access the object. Every operation on the data structure must first read the current logical timestamp, and then uses this timestamp to determine the visibility of objects within the data structure. To elaborate: If the current timestamp is ts, then an object is accessible to the current operation if and only if ts is within the interval [bts, ets), where bts and ets are the begin and end timestamps of the object to be accessed (in the paper, to represent the “infinitly large” time, we reserve timestamp zero as a special timestamp, the usage of which is restrivted to only in ets).

When a new object is being added, we first create a new object with bts equals ts + 1 and ets equals 0. The newly created object is hence insivible before we change the current time. After objects are added (there can be more than one), we issue a flush() command to force them back to the NVM, and then increment the current time counter, and eventually flush the counter back to NVM. The last flush operation is atomic, since is only writes back the 8-byte current time counter, which is also the persistent point of the newly added objects. Once the last flush() operation completes, all objects are visible to later operations since their bts is now exactly the logical timestamp. If, on the other hand, the system crashes before the last flush() completes, then the current logical increment is not successfully reflected on the NVM. After system reboot, the timestamp read from the NVM is still bts - 1, meaning that the newly added object is not visible to the post-crash recovery routine.

Similarly, when an object is to be deleted, we sets its end timestamp to ts + 1, flushes the object, increments the current timestamp counter, and flushes the counter. If the counter made into the NVM, then according to the version visibility rule, the object is no longer accessible to later operations, because the timestamp will be larger than or equal to the ets. If the system crashes, then these objects are still valid, because the current timestamp still lies in the [bts, ets) interval.

Objects can be reused if they have been made invisible by a previous increment of the current timestamp (so it must be a deleted object). The deleted object is reclaimed when an operation needs to create a new object (e.g. when inserting into a full B+Tree node but this insertion will trigger a node split). We overwrite fields of the deleted object, setting its bts and ets as described above, and then flush the object, making it live again. Please see the “Questions” comments at the beginning for a discussion of problems with this scheme.

After a crash, the crash recovery routine scans all objects in a CDDS and checks their timestamps. Some objects need to be “reworked” in order to function properly during normal operations. For example, if a delete operation is interrupted by the crash after it flushes all (or some) of the objects but before the current timestamp counter is flushed, all these objects will have a ets of ts + 1. Although the delete operation was not successfully committed thanks to the versioning protocol, if these objects’ ets fields are not fixed to zero, then after a normal insert or delete on any object in the CDDS, all or part of these objects will suddenly disappear, due to the fact that the current timestamp counter has been incremented, the value of which now equals the objects’ ets. Similarly, unsuccessful inserts and reusage of deleted objects will be cleaned up.