Lazy Release Persistency

- Hide Paper Summary

Paper Title: Lazy Release Persistency
Link: https://dl.acm.org/doi/10.1145/3373376.3378481
Year: ASPLOS 2020
Keyword: NVM; Persistency Model; LRP; Release Persistency

Back

Highlight:

Log-Free Data Structures do not need full barrier as they perform mutation via CAS, which always transits the state of the data structure from one legal state to another. As long as the contents are always persisted before the CAS, LFDs are always in consistent states
Multiple LFD state mutations can be persisted in parallel as long as they do not form data dependency, i.e., as long as they do not have release-acquire relation
Using release-acquire tags to indicate data dependendy. For example, if you CAS a delta node onto a B+Tree, then the CAS is tagged as release, meaning the delta node content should be persisted before the CAS is persisted. If other threads want to read this delta node, it must access it via the pointer (just CAS’ed) using acquire semantics, such that before the read can take place, all writes in the delta node and the CAS itself must be persisted, in case the read operation feeds another mutation.
LRP honors two rules: (1) When a release operation is synchronized with an acquire, all dirty lines before the release must be persisted; (2) When the L1 is about to lose track of a released line, the L1 controller should treat it as a synchronization, but the persistence of the released line itself can be in the background.

Questions

This paper is poorly motivated, because: (1) it does not state why LFDs do not need recovery (because state changes are made with CAS, maintaining the invariant that the data structure is in legal states on every CAS mutation); (2) it does not state why ARP is insufficient for LFD (because the CAS might be made persistent after writes to private nodes, i.e., the CAS should be treated as a single-direction barrier, but be enforced lazily); (3) it does not even state why the proposed RP is critical for LFD.
Although not the fault of this paper, ARP should not really be called ARP, since it has nothing to do with acquire-release consistency model. Conventionally speaking, acquire-release means that memory operations should not be ordered before acquire and after release, meaning that critical sections will not “leak”. The ARP discussed by this paper is simply enforcing ordering between data dependencies of different critical sections, and ARP is a very poor choice of name since reordering with release operation is actually allowed.
Writing to an already dirty line from a larger epoch should be undefined behavior, since in LFDs, such writes are expected to take place via an acquire operation first (e.g., acquire a lock, or read a pointer). If no acquire operation is used, and a dirty line from an older epoch is read or written, then data dependency can still happen, which leads to incorrect recovery.
Despite all the above issues, the biggest problem I can see from this paper is that, do you treat the epoch counter and ACK counter as part of the thread context? How do you deal with context swicthes between threads on the same core (all using LRP)? In that case, if an ACK comes, how do you know which thread’s ACK counter should be decremented, if the counter is part of the context? If the counter is not part of the context, then context switch may cause problem since epochs will no longer be monotonically increasing, and there are lots of corner cases.

This paper proposes Lazy Release Persistency (LRP) to optimize specifically the case for log-free data structures (LFD). The paper starts by observing that current persistency models all face similar problems. First, most previous proposals are overly restrictive when it comes to persistence ordering barriers in the form of cache flushes and memory fences. These proposals either fail to exploit inherent parallelism in most NVM workloads by stalling the pipeline on a persistence operation, which fully couples memory consistency with persistence, or they only insert full barriers regardless of the actual semantics level requirments, even though persistence is conducted in the background, which is still sub-optimal since not all legal reorderings are possible. Second, some proposals are overly relaxed, failing to enforce certain orderings, which makes it non-trivial for LFDs to perform crash recovery. These models, although still useful to other types of tasks, require extra software support in order for LFDs to work properly.

The paper gives detailed overviews of three possible persistence models. The first, Epoch Persistency (EP), although not directly mentioned in the paper, serves as the foundation of the other two, and is the only one currently implemented on mainstream x86 hardware. In epoch persistency, the dynamic instruction flow is divided into continuous pieces, called epochs. The persistency model enforces the rule that write operations must be persisted in the order of epochs, while allowing arbitrary ordering of writes within the same epoch. Application programs start a new epoch by issuing cache flushes to all dirty cache lines that are to be written into the NVM, followed by a store fence. The store fence operation will block the store buffer until all previous cache line flush backs have been completed, after which time later stores can proceed to the L1 cache and become visible. As discussed above, epoch persistency couples persistence with consistency, meaning that a memory operations must not become visible to other cores before previous stores in an earlier epoch becomes persistent, which limits the parallelism of persistence despite the fact that both the device and application have abundant degree of parallelism.

The paper then discusses Buffer Epoch Persistency (BEP), which is an improvement over epoch persistency by decoupling persistence from consistency. Writes in the store buffer are no longer blocked by non-persistent stores in the previous epoch. Instead, these writes are injected into the L1 cache as soon as they are ready, and the cache hierarchy relies on extra mechanism to track the epoch of writes, and persists them in the background while the epoch ordering is still being honored. Although BEP moves the persistence operation off the critical path of stores, it is still sub-optimal, since writes in later epochs cannot bypass writes in previous epochs, implying that when a write from a later epoch is to be evicted out of the hierarchy (or, most likely, L1, since this is where the tracking mechanism is implemented), the eviction must be blocked until all previous epochs have been completed.

The paper next introduces Acquire-Release Persistency (ARP), which relaxes the epoch property of EP and BEP by allowing epochs to be discontinuous in the dynamic trace, and hence can be persisted lazily. ARP delays the insertion of a persistence barrier to the beginning of the next epoch, instead of right after the end of the current epoch, overlapping execution with persistence. In ARP, memory operations are optionally tagged as acquire and release, which roughly correspond to lock acquire and lock release in explicit synchronization code. ARP mandates that if an acquire operation synchronizes with a release by operating on the same data item (typically an unlock-lock), then all writes performed before the release must be persisted before all writes after the acquire. In the hardware implementation, this is achieved by lazily inserting a full persistence barrier when an acquire sycchronizes with a previous release. The release operation itself, as well as all writes before it, are written back to the NVM before the acquire operation takes place.

The paper then indicates that Log-Free Data Structures (LFDs) can benefit from a well-designed persistency models. LFDs differ from conventional data structures in a way that LFDs do not require any explicit crash recovery, unlike the latter where logging or shadowing must be applied to roll back partial changes or redo committed updates. In LFDs, all state mutations are achieved by first performing private writes (e.g., allocate a node, and then initialize the node), and then publishing the state change using an atomic CAS or a release write operation. Multi-step mutations are performed by always using CAS for state transition, and each intermediate state is well-defined as a legal state to the data structure. On crash recovery, since the data structure’s state only transits between these legal states, no recovery is required, and the LFD is guaranteed to be consistent. Generally speaking, the most critical write ordering of LFD is that private writes must be ordered before the CAS, meaning that the content of the mutation must be in a recoverable state, before publishing them to other threads. Otherwise, the mutation may only be partially persisted after the crash, rendering the LFD inconsistent.

Although one may feel tempted to just tag the CAS operation as a release, and rely on hardware to control the persist ordering, the paper points out that ARP is insufficient for implementing LFDs. ARP, as discussed above, only controls write ordering between two acquire-release enclosed critical sections if the acquire and release access the same data item. The ordering between the release operation itself and writes before the release, however, are not enforced, which is fatal for LFD’s consistency guarantee.

On the other hand, EP and BEP are sufficient in terms of correctness, but rather inefficient since they require issuing two full barriers, one before the release, the other after the release. The second full barrier is important for integrity of data flow after crash, since any other threads accessing the CAS’ed data item can potentially establish data dependencies with the CAS’ing thread. If, after a crash, the CAS’ed value is not persisted, before the content it points to is used to conduct other mutations, it is possible that other mutations based on the pointer’s value have been persisted, while the pointer itself and hence the contents it points to are not. This violates the integrity of data dependency, even if the LFD is in a consistent state, since some states in the LFD depends on states that no long exist after the crash.

The paper therefore proposes Lazy Release Persistency (LRP), a persistency model and an architectural implementation to fill the performance and semantics gap between EP, BEP and ARP. Overall, LRP consists of two rules, which correspond to the two important ordering constraints of LFDs we discuss above. The first rule is that all writes must be persisted before the release operation, such that if the release operation is persistent, all prior writes must also be persistent. Second, data dependency must be honored, such that if one acquire operation and a prior release operation synchronize, writes before the release must also be ordered with writes after the acquire. Note that just like ARP, and unlike EP, BEP, the second rule only enforces write ordering between different acquire-release pairs when there is actual data dependency. In the case where data dependency is not present, writes from different acquire-release pairs can be persisted in arbitrary order, granting more freedom of reordering and hence higher degrees of parallelism.

The LRP hardware is implemented as follows. First, each L1 cache controller has a current epoch counter for identifying writes from different acquire-release pairs. Each pair represents an epoch (that is not necessarily continuous in the dynamic trace), and all writes within the epoch are tagged with the epoch counter value. The epoch counter is incremented whenever a release is issued to the cache. The cache controller also has a register for counting the number of outstanding persistence requests. The counter is incremented for every persistence operation issued to the NVM, and decremented when the NVM controller acknowledges the completion of the request.

To remember the minimum epoch a cache line is written, each cache line is extended with an extra epoch field which is updated when the line is first written in an epoch. This field is only updated once before it is cleared. In addition, each line also has a bit indicating whether it is written by a release operation. To track cache lines written by a release operation in the current cache, the L1 controller also maintains a release epoch table (RET), which is a CAM array storing cache line addresses and the epoch they are released (the release epoch is the new epoch after the epoch counter is incremented). The RET is queried when a cache line with the released bit set is evicted or coherence downgraded, in order to acquire the release epoch of the cache line. The paper suggests that 16 entries are sufficient for RETs, indicating that the system can support at most 16 outstanding mutation operations that have not been fully persisted. This grants far more parallelism than in EP and BEP, since dirty lines from these outstanding mutations can be written back at any moment in parallel without any write ordering enforcement.

We next describe the operation of LRP as follows. Similar to ARP, LRP also lazily persists dirty cache lines written in previous epochs, when a data dependency is about to form as a released line is acquired. This translates to the rule that when a released cache line is invalidated or downgraded by a coherence request, the L1 controller should delay the response to this coherence request, and immediately start an L1 cache tag walk to persist all lines written before the epoch in which the release operation is performed. The release epoch can be obtained by a RET lookup. In addition, since release and epoch information is only tracked at L1 level, when a released cache line is about to be evicted, the L1 controller must treat this as a potential data dependency as well (since after being evicted, L1 controller loses track of the status of the line, and has no way of knowing whether the line is acquired). The same tag walk is also initiated in this case.

The L1 tag walker is a state machine that scans the tags of all L1 cache lines, and evict those whose write epoch is smaller than the given epoch. The tag walker is invoked when a released cache line is about to be acquired or evicted, using the release epoch as the given value. Note that LRP requires that released cache lines should be persisted after all dirty data. Also note that when a tag walk happens, there can be several epochs to be persisted in the L1, indicating that there can be multiple released lines besides the one that triggered the tag walk. The paper proposes that the controller performs a two-pass walk. In the first pass, only ordinary cache lines with smaller epochs are evicted. This can be achieved by checking the per-line released bit. In the second pass, the remaining lines with a smaller epoch and release bit set are persisted. To monitor the persistence progress, The cache controller increments the request counter for each persistence request sent to the NVM. For every ACK received, the controller decrements the counter. The second pass is only started when the request counter drops to zero after the first pass. The walk completes only after the counter drops to zero after the second pass.

When a non-released dirty line is evicted out of the L1, it will be immediately persisted by the lower level controller. Such persistence will not trigger any tag walk, since LRP allows a younger epoch’s dirty line to be persisted before an older epoch, as long as these two epochs do not form data dependencies via release-acquire. Similarly, when a younger epoch writes a line that has already been dirty from an older epoch, the operation is performed normally except that the per-line min epoch tag is not updated. This is also allowed as long as these two epochs do not form data dependencies, which must be true, since otherwise the acquire operation would have triggered the persistence of all lines in the older epoch. Note: In the paper this issue is called “conflicts”, and is discussed in Sec. 4.2 rather than Sec. 5

There is a slight difference between coherence downgrade/invalidate and evictions. In former cases, the persist of the released line must stall the response of the coherence request, since otherwise data dependency would still violate persistence order (the value is visible before it is persisted). On the contrary, in the latter case, since the evicted line is likely not immediately acquired by a remote line. The cache controller, therefore, need not put the presist operation on the critical path of eviction. Instead, the lower level controller temporarily blocks all coherence requests to the evicted line, until the persistence has been completed.