Vorpal: Vector Clock Ordering for Large Persistent Memory Systems



- Hide Paper Summary
Paper Title: Vorpal: Vector Clock Ordering for Large Persistent Memory Systems
Link: https://dl.acm.org/citation.cfm?doid=3293611.3331598
Year: PODC 2019
Keyword: NVM; Vorpal; Vector clock; FASE



Back

This paper proposes Vorpal, a novel vector clock based scheme for ensuring memory persistence order on byte-addressable NVM. The paper assumes a programming model called Failure-Atomic Sections (FASE), which are critical sections synchronized using locks whose modifications are expected to be persisted to the NVM. An FASE must observe two properties. First, all modifications within an FASE should be atomic, i.e. either all stores are persisted, or none of them is persisted. Second, during recovery, if a FASE is rolled back, then none of the FASEs that depent on it should be persisted. In other words, data flow ordering should also be obeyed when determining whether a FASE should persist or be rolled back during recovery. In the past, researchers proposed to use runtime libraries to instrument lock and unlock operation to delineate the boundaries of FASEs, and then use a background thread to compute data flow ordering. The background thread computes the most recent consistent image by performing BFS on the graph based on unlock-lock relation, advancing the “frontier” to determine whether a FASE should persist or be rolled back. FASEs within the frontier can be persisted safely (since their dependent FASEs are also committed), while those not within the frontier must wait.

Vorpal is a hardware scheme that uses vector clock to encode the ordering of critical sections and store operations. In the paper, it is assumed that the system consists multiple cores, which can be on different sockets. Each socket has a few memory controllers that process read and write requests from the LLC. Since physical address space might be partitioned between these sockets, a store request generated at a core might be sent to a remote socket and written into remote memory. Local writes are processed by one of the local memory controllers on the chip. The paper assumes a fixed address-to-controller mapping, i.e. we use a hash function to map a physical address to a controller, and this relation is fixed. Each socket also has an interfacing component, called the gateway, which is responsible for communicating local reads and writes to remote sockets. All remote traffic must go thruogh this component. As we will see later, this component can be critical to reducing the storage overhead when the scale of the system is large.

Having multiple memory controllers can introduce problems for write ordering. For example, on today’s architecture, we enforce memory ordering using memory barriers, which consist of a sequence of cache line flush or write back instructions, and a store fence at the end. The store fence will be removed from the pipeline only when data written back to the NVM are acknowledged by the memory controller. The controller guarantees the persistence of the store once it has been enqueued, even on power failures (e.g. using ADR). The implicit assumption about this ordering guarantee is that: (1) memory controllers persist memory writes in exact FIFO order. Otherwise, another store instruction after sfence executed may be persisted before the one that gets enqueued before the sfence instruction, causing ordering problem; (2) If there are multiple memory controllers, writes from the same thread must always be mapped to the same controller. Because otherwise, two writes separated by a persist barrier might be written back in the wrong order by two different memory controllers, even if both of them process writes in the queue in strict FIFO order.

Vorpal uses vector clocks to address the problem that hardware cannot resolve dependent writes without more ordering information. In the baseline design of Vorpal, each core has a vector clock of size C, where C is the number of cores in the system. Each element i in the vector of core j (except the case where i == j) represents the timestamp of processor j when last time i and j communicates. Element j on processor j is this process’s local timestamp. This local timestamp is incremented to indicate that all stores after the increment must be persisted only if stores before this timestamp are. We later show how this is achieved by letting the memory controller compare timestamps.

Store instructions are also tagged with a timestamp, which is the current vector clock of the processor that issues the store. Memory controllers can read and compare timestamps to determine which store should persist first. Every memory controller also maintains a vector clock, which is updated (merged) by the timestamps of stores it just processed. Memory controllers periodically broadcast its local vector clock to other controllers, in order to guarantee progress. This is necessary, since each memory controller only sees a subset of all stores generated from processors. Its local vector clock, therefore, is also a merge of all timestamps in this subset. The actual value at each memory controller might be smaller than the “logical” vector clock which is the merge of all writes issued by all processors. Locks used as synchronization variables are also tagged with vector clocks, which is updated to the vector clock of the processor that unlocks it. In general, the local timestamp of a processor is updated when: (1) An acquire operation (i.e. lock) successfully enters critical section protected by lock L. In this case, the local processor’s timestamp is merged with the vector clock in the lock variable, and then the processor increments its own counter (i.e. element j for processor j) by one. This essentially synchronize all following memory writes against the unlocked critical section, which we always assume to have a data dependency (while dynamic data flow may not actually exist); (2) Clocks are also updated when a persist barrier is executed. In this case, the processor simply increments its own counter by one in the local clock, which synrhconizes all writes before the barrier against writes after the barrier. On the memory controller side, when a store has been processed by the controller, it updates its lock by merging the clock of the store with its own local clock. The memory controller then selects the next store to process based on the following two rules: (1) All elements except element i are either smaller than or equal to all corresponding elements in the controller’s local clock; (2) For element i, the value in the store instruction equals the corresponding value in the memory controller’s clock plus one. It is hence guaranteed that no timestamp can be smaller than the the selected store’s timestamp, while being larger than the “abstract” timestamp which is the merge of all stores that have been processed. In practice, since the controller only maintains the merge of a subset of all possible timestamps, the actual value in controller’s local vector might be smaller than the “abstract” timestamp. This is exactly the reason why memory controllers need to broadcast theie local timestamps to others. On receiving such a broadcast, the memory controller merges the broadcasted value with its own, after which the selection process is re-attempted.

The notion of vector clocks guarantee that, if a clock-incrementing event A happens before another event B, then the corresponding vector clock value at time point A is guaranteed to be smaller than the value at point B (the reverse, however, is not necessarily true). In our case, if critical section X acquirs a lock which is released by another critical section Y, then all stores in X have a vector clock value larger than stores in Y. For vector clocks a and b, we say a < b if and only if every element of a is smaller than every element of b on the same index. Merging two locks is simply just taking the larger element from index i of both vectors, and put it into index i of the output vector.

The paper proposes three slightly different Vorpal schemes based on the above baseline. The first scheme, Vorpal0, requires the least hardware change, but introduces unnecessary write orderings. In Vorpal0, in addition to the two rules of incrementing local clocks, we add a third rule: (3) The local clock should also be incremented on every persistent store instruction, serializing the persistence of all stores even within a critical section. Despite of unnecessary serialization, this is the only way to ensure correctness of persistence without extra timestamp logic. Imagine the case where timestamps are not incremented for stores within the same critical section. All stores within a critical section will have the same timestamp. If a store gets delayed (i.e. cache coherence issues) in the cache hierarchy, while a following store with larger timestamps (from another critical section) arrives at the memory controller, the cache controller is not aware of the delayed store, and will proceed to persist the store from the next critical section. This violates the ordering guarantee provided by vector clocks.

In a slightly improved design, Vorpal-chunk, extra hardware is added to ensure that the above scenario does not happen. In this scheme, processors divide stores into “chunks”. A chunk boundary is inserted when: (1) the current store will be processed by a different controller from the previous store; (2) on successful lock acquire and persistent barriers; and (3) the processor’s store buffer is full. The local is also incremented when a new chunk is created locally. All stores within a chunk have the same timestamp. The processor buffers the chunk and keeps appending elements to the buffer until the boundary is inserted, at which time the chunk is sent to the memory controller. Each chunk also contains metadata such as the vector clock of the chunk, the number of stores, the next chunk’s vector clock, and the ID of the controller that will be processing the next chunk (these fileds are filled only when the current chunk terminates; their usages are explained later), which is also sent to the memory controller. On processing a chunk, the controller inserts the chunk and its metadata into a lookup table. Then for every store operation processed with the chunk’s vector clock value, the counter is decremented. The local clock of the controller, however, is unaffected (to avoid selecting stores after the critical section / barrier). Stores within a chunk can be scheduled freely to maximize the hardware characteristics. When the value of the counter reaches zero, the controller removes the matdata entry from its lookup table, and updates the local clock.

The last scheme is Vorpal-chunk+, which further optimizes over Vorpal-chunk to enable even higher parallelism. It works as follows. Instead of creating a chunk on every controller switch, which essentially still serializes chunks sent to different controllers, we now allow the processor to create chunks whose processing is distributed across controllers. In this scheme, no chunk boundary is inserted when we switch to a new controller, and the vector clock is not incremented. The metadata and associated stores in the buffer, however, is sent to the controller when a switch happens, and we update the metadata after flushing the buffer (e.g. the next controller ID). The last chunk of the critical section / persistent epoch should also be marked. When controller receives the chunk, it processes the chunk as described in the above paragraph. When the chunk is finished, instead of updating the local vector clock, the controller sends a singnal to the “next controller” indicated by the metadata field “next controller ID” as a notification of previous chunk commit. On receiving the chunk commit message, the controller commits its local chunk, and then sends the same message to the next controller in the chain of chunks. After the last controller on the chunk chain receives the message, the entire chunk is committed. All controllers in this process needs to update their local clocks by merging the current chunk’s timestamp.