Pronto: Easy and Fast Persistence for Volatile Data Structures



- Hide Paper Summary
Paper Title: Pronto: Easy and Fast Persistence for Volatile Data Structures
Link: https://dl.acm.org/doi/10.1145/3373376.3378485
Year: ASPLOS 2020
Keyword: NVM; Pronto; Logical Logging



Back

Highlights:

  1. Semantics logging, or logical logging, has the benefit of very little logging data overhead, and very weak write ordering. Only the method identifier and arguments are saved to the log, instead of every data update. Persistence can also be moved to the background, since data updates will never have write ordering with logs (only control path is ordered with log persistence).

  2. The observation on linearizability is interesting, because log entries are stored linearlly. If a method requires time travel in serialization order, then the method’s log entry will have to be inserted into the stream of the log, requiring shifting of existing entries.

Questions

  1. op_commit() should be the point that log entries are generated and persisted, rather than op_begin(), since log entries must be written in one of the the valid serialization orders. The order can only be determined when the serialization point is attained, e.g. after all locks are acquired in 2PL or after the write set is locked in OCC. If persistence begins at op_begin(), it is still unknown the ordering of the log entry, since two concurrent operations sharing at least one lock in the 2PL example will race on op_begin(), but not race on op_commit(). Even in the example of concurrent unordered_set given in the paper, it is possible that operation A executes op_begin() first, but acquires lock L later, while operation B executes op_begin() after A, but acquires the lock earlier. In this example, the ordering of the two log entries (A > B) are, in fact, inconsistent with their logical ordering (B > A).

  2. How does background thread synchronize with each other? Do they acquire a global monolithic timestamp using fetch-and-increment? Background threads may parallel append to the log, but the paper does not give any detail on how race condition is avoided in this case.

  3. The paper mentions “commit number” in section 4.1, without ever explaining it. Is it the global counter in the previous point?

  4. How does Pronto, as a user level library, handle page faults during checkpointing? Is a user level page fault handler be employed (I know there is no technical difficulty in doing so, but the paper should explicitly point it out).

This paper presents Pronto, a software library for transforming existing data structures implemented on volatile memory to their non-volatile counterparts. The paper points out that the current trend of implementing non-volatile data structures either use libraries to denote failure-atomic regions, or develop ad-hoc algorithms to support persistence. In order to denote failure-atomic regions, the programmer needs to take an existing data structure implementation, and instrument all memory writes with library calls. This can be tedious, errorneous, and most importantly, it is not always possible to instrument all memory writes, due to the usage of third-party libraries. On the other hand, designing an ad-hoc data structure requires large amount of work on debugging and verification. In practice, neither approach works very well.

Pronto solves the above issue with a combination of semantics logging, background persisting threads, and periodic checkpointing. Instead of logging every data update within the failure-atomic region, requiring persistence barriers for each update, Pronto only persists the logical operation and their parameters to the NVM. Recovery is performed by re-executing these invocations in the logical serialization order, bringing the memory state of the data structure to the pre-crash state. To avoid executing the persisting barrier for the per-invocation log entry, Pronto proposes that background threads be used to overlap persistence with execution. Thanks to semantic logging, the only write ordering that must be observed by the persisting thread and the executing thread is when the operation completes, the function must return after the log entry is persisted. This enables the background thread to persist the log entry at any point during execution, unlike undo or logging, where a per-store write ordering must be observed, blocking store operations until a log entry is persisted. Checkpointing is also employed to reduce recovery overhead and perform garbage collection on the semantic log. Periodically, the background checkpointing thread saves the current consistent image of the working set to the NVM. Recovery can simply start from the most recent checkpointed state, and only re-execute log entries generated after the checkpoint.

Pronto is implemented as a C++ class, PersistentObject, serving as the base class of data structure types. The two most important methods of the class are op_begin() and op_commit(), which generates a semantic log entry and persists the entry to the NVM respectively. Programmers can just take an existing data structure class, make it a subclass of PersistentObject, override methods whose effects should be persistent with a wrapper function, and call op_begin(), op_commnt() in in wrapper function. In all cases, op_begin() is inserted at the beginning of the function, taking the function pointer and arguments to the wrapper as parameters. Internally, a semantic log entry with the value of the method pointer and its arguments are generated, and dispatched to the background thread for persistence. op_commit(), on the contrary, synchronizes the execution thread with the persisting thread, the effect of which is similar to how write ordering is enforced using persistenct barriers. Serial operations can just insert op_commit() at the end of the wrapper before it returns, while parallel data structures should make sure that the order of entries in the semantic log is consistent with the logical ordering of the real-time execution. For example, the paper suggests that, for data structures using 2PL, op_commit() should be called right after all locks have been acquired, when 2PL attains the serialization point.

In order for re-execution to be feasible, Pronto requires that all observable states produced by methods from the volatile data structure must be a function of their arguments. In other words, methods must not access any global state to ensure reliable re-execution.

Furthermore, in the case of parallel data structures, Pronto also requires that all methods must be linerrizable, i.e., in addition to being serializable, the logical serialization point of each method must also be within the real time execution of the method, not time-travel to the past, or be postponed into the future after the method returns. If this is not observed, it is difficult to guarantee that the log ordering is consistent with serialization order, since the log entry must be persisted before the method is invoked, which requires knowledge into the future, or after the invocation returns, which requires non-trivial bookkeeping.

Pronton spawns one background persistence thread for each foreground thread. Foreground and background threads communicate via op_begin() and op_commit() calls. On op_begin(), the foreground thread passes the identify of the method being called as a function pointer, as well as the arguments, to the background thread. Persistence of the log entry is performed by the background thread while the foreground thread executes. A flag is set after persistence completes. On op_commit(), the foreground thread repeatedly checks the flag until it is set. After op_commit() returns, the operation is logically persisted, which is guaranteed to be recoverd after a crash.

The semantic log is maintained as a fixed sized circular buffer on the NVM address space. Log entries consist of the method pointer, method arguments, and a commit number (although not explained in the paper). Log entries should be garbage collected regularly by writing checkpoints to the NVM. A checkpoint consists of a consistent image of the data structure, with no active operations half way during execution. Pronton should be able to restore the checkpoint to the exact same virtual address when it was taken on recovery, as all pointer values are saved as-is without relocation.

We elaborate the checkpoint process as follows. When checkpoint starts, the background thread first avoids new operations from being started by blocking op_begin(). Then it waits for all existing operations to complete for the data structure to enter a consistent state. It then changes the access permission of all pages used by the current working set to read-only, after which op_begin() is unblocked. The checkpoint thread then streams all pages in the working set to the NVM in the background. The permission of a page is reverted back to read-write after its checkpoint has been taken. If a foreground thread updates a page before the permission is reverted, a page fault will be raied by the OS. The paper suggests that Pronton installs a user-level page fault handler, which gives priority to the faulting page by checkpointing it first, after which the permission of the page is reverted. Note that unlike an OS fork() system call, copy-on-write is unnecessary in this case, since the checkpoint process is read-only. After a checkpoint is completed, all log entries generated before the checkpoint is taken can be safely recycled.

To help tracking the working set at page granularity, the paper also proposes that a specialized DRAM allocator be used to replace the default allocator. Pronton’s allocator reserves a range of consecutive virtual pages, and tracks the current usage status of these pages for checkpointing purposes. The consecutive range can also be extended dynamically, if allowed by the OS. Note that each instance of the data structure must have a separate instance of the allocator of its own consecutive page range. The paper also suggests that huge pages can be used to reduce TLB misses, if the working set is large.

On recovery, the recovery handler first locates the checkpoint on the NVM, if any, and loads the memory image back to the DRAM. The handler must ensure that the image can be loaded to the same virtual address as before the crash. The handler then scans the semantic log for the first entry whose commit number is larger than the commit number of the checkpoint. Log entries are replayed by re-executing them in the logical serialization order.