MOD: Minimally Ordered Durable Datastructures for Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/3373376.3378472
Year: ASPLOS 2020
Keyword: NVM; MOD; Data Structure; Shadow Paging
Highlight:
-
Combining fine-grained shadow-paging with tree structures is a convenient optimization over shadow paging, since no special mapping table is needed, and an atomic pointer swing suffices to update the consistent image.
-
Many common data structures, such as vectors, maps, sets, lists, can all be represented with a uniform persistent data structure. This simplifies the implementation of algorithms for these data structures, since it is sufficient just to implement the algorithm for the underlying tree structure.
-
Write cascading only incurs write traffic proportional to the tree depth. The tree depth is logarithm of the total amount of data, indicating small extra memory usage (not write amplification!).
Questions
-
Not really a low light, but I am just thinking, is it possible to compose logging-based operations by passing a flag to disable log commit at the end of the operation (i.e. does not write undo commit record), and continue with the next operation which is supposed to be atomic with the first one. This can continue until we finished all operations, at which time a commit record can be written. Is there any difficulty implementing this, instead of shadow paging-based composition?
-
The introduction of pure functions to explain pure data structures actually over-complicates things. Instead of using pure functions, you can just say shadow paging does not alter public states while all updates are performed on the private copy that has not yet been made public, which will be made public using an atomic pointer swing. This rings better for system people, as most people are unfamiliar with PL concepts.
-
Although the total amount of extra storage for shadowing is small compared with total amount of data used, the paper did not measure write amplification to the NVM. Given large nodes (32 slots, for example, if radix trees are used), a single word update will suffer from a write amplification of at least 32x, not counting upper level propagations. This is really bad even compared with logging, but the paper did not evaluate write amplification to the NVM.
This paper introduces Minimally Ordered Durable Data Structures (MOD), a software library of persistent data structures that features fast persistence and composibility. The paper identifies that conventional logging-based persistence data structures that implement failure-atomicity have two issues. The first is excessive write orderings, which is caused by logging. For example, in undo logging, each log entry write must be persisted before the cache line is updated in-place to avoid dirty data being accidentally evicted back, polluting the consistent image before the log entry does. This limits the parallelism of cache line eviction on a single core to one, since the next eviction cannot start before the current one completes, while the actual hardware is capable of parallel eviction. The second issue is usability. Logging-based data structure operations are hard to compose, due to the fact that the undo log must be committed by writing a end-of-transaction mark at the end of the operation, limiting the scope of the failure-atomic region.
This paper also makes two important observations on the degree of parallelism on NVM flushing. The experiments involves issuing flush instructions to a certain number of randomly chosen dirty cache lines, and then issue a memory fence. Latency of the fence is measured as the overhead of write barriers of a certain write size. The first observation is that no matter how many flushes are issued in parallel, it seems that at most 16 of them can be overlapped without significantly increasing latency, but still at a lower cost. This might be an indication that the internal eviction buffer or MSHR for flushes have a maximum capacity of 16. When parallelism exceeds this value, the flush instruction will stall the pipeline until one of them is released. The second observation is that approximately 82% of all flushes can be parallel, while 18% of them are serial. This ratio remains pretty much consistent as the parallelism of experiments change, indicating that the performance of flushes are mainly determined by the 18% non-parallel writes to the NVM. Personally, I would believe that this is caused by random addresses hitting the same bank or persistent buffer within the NVM device, serializing accesses to these banks or buffers. Overall, these experiments confirm that the logging-based persistence approaches under-utilize cache line flush and NVM parallelism, resulting in sub-optimal performance.
Based on the above observations, the paper proposes that shadow paging be used instead of logging to implement failure-atomic persistent data structures. Shadow is made possible by using persistent data structures, which are naturally implemented with multi-versioning. Note that the word “persistent” here does not imply durability. Instead, it reflects the fact that the data structure will retain a previous version after being updated to a new version. Both versions can be accessed after the update, as if they are individually maintained.
The persistent data structures, within its internal, are usually implemented with tree structures, such as radix trees. When part of the structure is updated, a new node N is allocated, with the update applied. The upper level node M is also duplicated, with the pointer to the old N updated to point to the new N. Node M’s parent should also be updated similarly. This process propagates up until we reach the root of the tree, after which a new root is allocated. To access the newly updated tree, a new instance of the data structure wrapper, which consists of a pointer to the root node, the size, and other metadata, is allocated. The new instance points to the new root node, with metadata fields modified to reflect the update.
To incorporate the above model into the persistence (durable) framework, extra cache line flushes and persistence barriers are inserted to ensure write ordering. Thanks to the write pattern where no in-place updates are conducted, the consistency of the pre-update state remains intact, while node duplication, updates to the duplication, wrapper dunplication, and metadata updates are performed. Cache line flushes are inserted for new nodes and the new wrapper after updates completes, enabling parallel flushes. A persist barrier is then issued to ensure these updates are persisted before the next step. In the next step, the original data structure pointer is updated to the new wrapper object. This update can be performed with only a word granularity store, which is always atomic given that the word is properly aligned. The second persistence barrier is then issued to ensure that the pointer change is persistent as well, before returning to the caller.
Note that although the shadow update model essentially generates a new instance of the data structure under updates, the above description encapsulates this with a second barrier, which updates caller’s pointer with an atomic store with the new instance. This usage model also slightly changes the programming paradigm: In addition to passing all arguments for the update operation, the address of the pointer to the data structure instance should also be passed as an extra argument, which is updated with the address of the new instance on completion of the update.
In addition to the single operation usage model, the paper also provides two extra interfaces for operation composition.
The first, more general form is CommitUnrelated()
. Users should directly call update functions which return new instance
pointers (rather than wrapper versions which updates the pointer atomically) on objects. Then the user calls
CommitUnrelated()
on multiple old instance pointer addresses, paired with the new instance pointers. This function
essentially starts a software transaction (with conventional approach) to update these pointer addresses atomically,
achieving operation composition, since the effect of these operations are only persisted after all pointer swings are
completed.
The second interface is CommitParent()
, which is an optimization over CommitUnrelated()
. While the latter operates
on any arbitrary composition of objects, with objects connecting with each other in arbitrary topology, the former
commit function specifically optimizes the case where the objects to be updated all share a common parent node, which is
also given in the argument list. In this case, in order to commit all updates atomically, a new instance of the parent
object is created, with all fields that point to updated objects set to their new instances.
The parent object is then persisted atomically using a pointer swing and a persist barrier.
MOD constantly abandons one instance of the data structure wrapper and a few nodes that are “shadowed”, leaving them inaccessible. To prevent memory leak, the paper suggests that each node maintains a reference counter, which is incremented when a new reference is established on the node, and decremented when un-referenced. The node can be reclaimed when the counter drops to zero. The wrapper instances can be freed instantly, since they will not be shared.