PMEM-Spec: Persistent Memory Speculation

- Hide Paper Summary

Paper Title: PMEM-Spec: Persistent Memory Speculation
Link: https://dl.acm.org/doi/abs/10.1145/3445814.3446698
Year: ASPLOS 2021
Keyword: NVM; PMEM-Spec; Speculation

Back

Highlights:

Programming would be much easier if persistency order and consistency order agree. This can be done by making the hierarchy write-through, and always sending a copy of the write to the NVM controller when the write is also applied to the L1 cache. The LLC will drop eviction data to the NVM since they have already been sent.
Since writes are write-though, in most cases consistency and persistency order would be identical without any special enforcement. We can optimistically assume this since it is the majority of the cases. Special mechanism, meanwhile, is added to detect anomaly when these two disagree (i.e., memory consistency suggests one ordering, but NVM sees them in another order).
During the window between the store buffer draining a write and the NVM controller receiving the write, both read-after-write and write-after-write violation may occur. The former is solved by monitoring a special pattern, and the latter is solved by ordering writes using critical section’s serial numbers (speculation IDs).

Comments:

The paper did not say how to detect write-write ordering violation. Does the controller maintain per-block metadata on last received speculation ID (unlikely)? Or there is one global speculation ID? If it is tracked as one single ID, then the change of false positive would be huge, since it essentially assumes data dependency between all writes in different critical sections.
If speculation ID is maintained per-process, what if a context switch happens? Should the NVM controller be aware of the switch, and load a new “last-seen” speculation ID for the new process just switched in? On the other hand, if there is only one global speculation ID (hard to maintain by the compiler), then context switch can still be a problem, since the ID after switching back may not be the same as one before the switch.

This paper proposes PMEM-Spec, a hardware mechanism for enforcing strict write ordering between the memory consistency model and NVM device. The paper points out that write ordering is critical for NVM-related programming, since the order of memory operations performed on the hierarchy often differs from the order that NVM device sees them. This is caused by the fact that the cache hierarchy can arbitrarily evict blocks back to the NVM, while programmers have only very limited control over the lifespan of blocks in the hierarchy. Current x86 platform provides two primitives, namely clflush for flushing a dirty block back to the backing memory, and sfence for ordering cache flushes. A software persist barrier is formed by using multiple clflush followed by an sfence in order to establish write orderings for write operations before and after the barrier.

Software barrier, however, is far from optimal, since it is implemented on software, and therefore, stalls the processor with fence instructions while previous stores are being persisted. This also restricts parallelism, since a thread can have at most one outstanding barriers, limiting hardware level parallelism of the NVM device that can be taken advantage of.

Previous hardware proposals, on the other hand, solve this issue by allowing multiple “strands” of persist barriers to be active for a single thread at the same time, or decouples memory consistency (i.e., visibility of a modification from the perspective of other threads) from NVM consistency (i.e., when the NVM device sees a dirty block) such that persistency is moved to the background without stalling the core pipeline. These proposals optimize towards efficiency, but the paper points out that they often bring excessively complicated hardware, and are intrusive to the existing hierarchy such that the coherence protocol is modified. Besides, the paper also points out that more complicated persistence models need programmers to annotate the source code, which requires programmers to learn these models first as well as to understand the subtlety of different models.

PMEM-Spec adopts a different approach by not proactively enforcing memory consistency and NVM consistency, but optimistically assuming that these two will be equivalent in the majority of cases. In other words, it speculates under the assumption that these two orderings are always identical. In the rare case where they are not, the hardware detects a violation, and interrupts the offending process of the ordering issue. The process then rolls back the current transaction or FASE, and restarts.

We next introduce the details of the design as follows. PMEM-spec adds one more data path from the processor to the NVM device. On conventional platform, cache blocks are brought into the L1 cache on write misses, which are then updated, and then evicted via either natural evictions, or clflush instructions. PMEM-spec, on the contrary, demands that any write operation performed at L1 level must be sent to the NVM controller immediately, without using the ordinary data path. This can be done by sending write data to the controller via the on-chip network when write instructions retires from the store buffer (i.e., when it is drained into the L1 cache). Dirty evictions from the LLC will then be ignored by the NVM controller (actually, only data will be discarded, and the eviction message itself serves an important purpose and must be sent by the LLC). There are two important things to note in this process. First, this essentially turns the hierarchy write-through since all writes are directly sent to the NVM controller. The dirty bit, on the other hand, must still be preserved, which is used for monitoring the lifespan of dirty blocks, as we will see later. Second, although stores are drained from the store buffer to the L1 cache and to NVM controller at the same time, this process is still not atomic, i.e., events from other cores may observe the short window of inconsistency when the updates have been applied to the L1, but not to the controller. This opens the possibility of mis-speculation where the memory consistency differs from NVM consistency, which is the topic in the following paragraphs.

The first type of consistency problem occurs when a load from another processor misses the hierarchy and accesses NVM. If the load happens after the store is issued to L1, but before the corresponding modification is received by the NVM controller, the load operation will read a stale value from the NVM device, while the most up-to-date value is still on the way being sent to the NVM controller. Although this is extremely unlikely, since the block must be evicted by the entire hierarchy before the direct write arrives at the NVM controller, the chances is still not zero. In this case, the load operation from another core accesses a stale version of data, since according to memory consistency, the load happens after the store, but the NVM access ordering disagrees.

The second type of inconsistency occurs when two writes are applied on different cores (and thus forming a write-after-write data dependency, which is serialized by the coherence protocol since the two writes happen on the same address), but their corresponding direct write messages sent to the NVM controller arrive at the wrong order. In this case, memory consistency sees one order, but the NVM controller sees the other, resulting in lost writes since the earlier write in memory consistency order is performed after the later write on the NVM.

The first type problem can be solved by monitoring dirty block evictions at the controller level, since in order for such a rare case to happen, it must be that an eviction occurs before a load to the same address misses the hierarchy, after which the direct write message arrives, also on the same address. The NVM controller, therefore, monitors all eviction events from the LLC (recall that eviction to the NVM controller only sends the message without data, since data should be written by the direct write message). If a load later accesses the same address, followed by a direct write, the NVM controller then decides that a violation has occurred, and informs the processor by sending a hardware interrupt. This is implemented by a small buffer on the controller which remembers recent eviction addresses. Each entry of the buffer maintains a state variable, which recognizes the pattern of “eviction-read-write”. An interrupt is fired when the pattern recognition succeeds.

Entries in the controller buffer will not stay indefinitely to avoid false positives, since not all traces that match the pattern are violations. The design further takes advantage of the observation that there is a maximum latency of NoC between the store buffer and the controller. When an entry is created, the controller only needs to monitor for the pattern for this amount of time, since after that the violation is impossible to occur. The entry will then expire and be removed.

The second type problem is solved by tagging writes using a version number that is consistent with the data dependency order. The paper assumes that the application to be executed is always data race free, implemented with exclusive locks. Writes to the same data item must always be exclusive from each other, i.e., wrapped within the same lock. This way, the NVM controller can infer the ordering of writes received using the ordering of lock acquisition. To this end, the paper proposes that the compiler maintains a per-process global version number (called speculative ID) for all critical section locks within the application, which is incremented every time a critical section is entered. The version number is also loaded into a special register by a new instruction after lock acquisition. Every store operation is tagged with the version number before sent to the NVM controller. The NVM controller always remembers the version of the last write received, and if a newly received write message has a lower version number, an interrupt is raised to the processor that sent the lower versioned write, since the write had arrived late and would otherwise overwrite a logically later write.

On receiving the interrupt indicating an ordering violation, the OS forwards a signal to the violating process based on the physical address mapping (multiple processes if they share physical pages and the violation happens on the shared address). The paper assumes that the threads within the violating process has a way of coordinating with each other when violation occurs. For example, the thread that handles the signal may just set a flag that will be checked by all other threads at commit time (lazy resolution) to indicate the violation. Threads roll back the transaction or FASE with runtime support (most runtimes already provide such support for recovery or transactional abort). Alternatively, the signal handling thread may just broadcast the signal to all other threads, forcing the entire application to stop and recover, which fully utilizes existing recovery mechanism at the cost of higher overhead. This approach is called a “virtual crash”, in which case the application just behaves like a real crash had happened, and performs crash recovery with existing runtime support.