Efficient Persist Barriers for Multicores



- Hide Paper Summary
Paper Title: Efficient Persist Barriers for Multicores
Link: http://homepages.inf.ed.ac.uk/vnagaraj/papers/micro2015.pdf
Year: MICRO 2015
Keyword: NVM; persist barrier; persistency model



Back

This paper proposes a new implementation of persist barrier, a common programming language construct for NVM programming. Prior works have proposed four persistency models of different semantics and constraints. The first and the most straightfirward persitency model is strict persistency, in which the order that memory operations persist must follow the order that they become visible. In other words, the persistency model follows the consistency model. To achieve this, processors must not make store operations visible to other processors via coherence before these operations are persisted to the NVM, which essentially make the cache write-through. This model suffers from performance issues for three reasons. The first reason is that writes must propagate through the entire cache hierarchy which usually consists of several levels, since the processor bypasses the cache hierarchy and directly writes into the NVM. The second reason is that NVM writes are typically much slower than a local cache write. The latency of store operations is longer even compared with uncached memory writes. The last reason is that since the pipeline must not reorder persistent stores with other stores, there is little chance that NVM writes can be coalesced or batched in order to take advantage of the internal scheduling of the memory controller. The second model is epoch persistency, which relaxes strict persistency by allowing store operations to be persisted in a different order than the one they become visible. A special memory ordering primitive called the epoch barriers are inserted into the instruction stream. It is required that store operations before the barrier must be persisted before any store operations after the barrier. The easiest implementation of epoch persistency is to track dirty cache lines accessed within an epoch. At the end of the epoch, these dirty cache lines are forced to be written back to the NVM using special instructions such as clwb. Regular store barriers may also need to be inserted to enforce correct ordering between epoches. The processor in the meantime must stall and wait for the persistence to complete before continue executing the next barrier. Epoch persistency overcomes some of the undesirable properties of strict persistency, such as long latency on the store critical path. It is, however, still inefficient, since the processor remains idle while an epoch has finished and is being persisted. The third type of persistency model is buffered epoch persistency (BEP), which is essentially the same as epoch persistency, except that processors can continue executing the next epoch while the previous epoch is being written back to the NVM. BEP completely decouples memory persistence from write visibility: A memory operation can become visible to other processors arbitrarily earlier than the operation becomes persistent. In non-buffered epoch persistency model, this is impossible, because a memory operation is guaranteed to be persisted when the processor executes the next epoch barrier. Although epoches are not immediately persisted when the corresponding epcoch barriers are executed, the correct ordering must be maintained. This paper identifies two sources of epoch orderings, which will be discussed later. The last type of persistence model is buffered strict persistency (BSP). BSP is different from the above all persistency models in that it requires atomic execution of epoches. Executions are divided into epoches, which is the atomic unit for memory instructions to become both visible and persistent. Within an epoch, no coherence message is sent on writes, implying that their visibility as well as persistence are postponed to the end of the epoch.

This paper focuses on providing an efficient hardware implementation of buffered epoch persistency. Traditionally, BEP is implemented using the lazy scheme: Dirty cache lines are not evicted from the cache when an epoch completes. These dirty lines only leave the LLC for two reasons: Either the LLC chooses the line for eviction as per the eviction policy, or the line observes an epoch conflict and must be forced back to NVM. The paper further identifies two cases for epoch conflicts: Intra-thread conflict and Inter-thread conflict. Intra-thread conflicts occur when a later epoch in a thread writes to a dirty cache line generated by a previous epoch for the first time. If epoch B overwrites epoch A’s dirty data which has not been persisted yet, the store instruction cannot proceed before all cache lines in epoch A are persisted. To see the reason, assume that the store operation in epoch B overwrites the line, and then the system crashes before epoch A is fully persisted. In the post-crash image, a store operation from epoch B can be observed, which violates the ordering constraint posed by the persist barrier: Stores from epoch B must not be persisted until all cache lines in epoch A have been persisted. Similarly, an inter-thread conflict is observed if an epoch B on thread Y reads from or writes to a dirty cache line generated by epoch A on thread X. The semantics of persist barrier require that epoch B must be persisted after epoch A. Imagine the case where epoch B persists before epoch A and the system crashes before A is persisted. In the post-crash image, epoch B contains a value generated or derived from epoch A, but the corresponding line in epoch A has not been persisted, and is lost during the crash. This outcome is inconsistent with normal execution, because epoch B contains a value from nowhere. This scheme is called “online flushing”, because cache lines of an epoch is only flushed back to the NVM on-demand, and the processor must stall to allow the flush operation to complete.

In order to enforce epoch dependencies caused by inter- or intra-thread conflicts, traditional implementations stall the destination processor when such dependency is about to form, and initiate an epoch flush on the source processor. This adds a significant number of cycles to the critical path of the destination processor if the source epoch has a large working set. This paper addresses the problem using a tachnique called proactive flush: Instead of flushing epoches only on-demand, the epoch flushing process is initiated as soon as an epoch completes. To support this, every cache line in the hierarchy is extended with a global epoch ID, which consists of two fields: A local epoch ID, and a thread ID. The local epoch ID is assigned in a thread-local manner by the thread that executes an application using epoch barriers. It is implicitly assumed that epoches within the same thread must commit following the numeric order of the epoch IDs. Thread IDs are unique identifiers of threads assigned by the Operating System. The global ID of the current executing epoch is specified as operands to the epoch barrier primitive, which can be maintained and generated by the NVM library. During the execution of an epoch, whose global epoch ID is , if a “fresh” cache line (i.e. does not have an assigned global epoch ID) is modified, then the global epoch ID of the fresh line will be set as the global epoch ID of the writing thread. After write back operations, i.e. the content of the cache line is written back to the NVM and the state becomes non-dirty, the global epoch tag is cleared, because it is no longer necessary to track dependencies from an already committed epoch. In the next paragraph we focus on detecting and tracking the two kinds of conflicts discussed above.

In order to track conflicts, each physical core has two tables: a dependent table, and a source table. The source table stores global epoch IDs of dependency sources if the current epoch must commit after the source epoches. Similarly, the dependent table stores global epoch IDs of dependents if the current epoch must commit before the dependent epoches. There are only limited number of entries in each table. If an entry is to be allocated, but the table is full, then processors fall back to use the online flushing scheme described above. In order to detect both types of conflicts, it is sufficient that the processor checks the global epoch ID tag of a cache line on certain accesses. For intra-thread conflicts, the processor checks whether the thread ID is identical to the current thread ID, but epoch ID is larger (should not be smaller in normal cases) when executing store instructions. If it is the case, then a intra-thread dependency is formed, and the newer epoch must stall and wait for the newer epoch to be flushed. The chance of rewriting a cache line . For inter-thread conflicts, the processor checks whether the thread IDs are different. If it is the case, then a dependency from the older epoch to the newer epoch is formed by inserting an entry into the dependent table and another entry into the source table. The value of both entries reflect the newly added dependency edge. Note that although the paper did not mention what if the two conflicting threads are on different cores, my best guess is that core ID of the writing processor should also be tracked in cache line tags. The core ID is not used as part of the global epoch ID, but when an inter-thread conflict is detected, the core ID can be used to identify the source processor in the dependency.

After adding dependencies between epoches, the L1 cache next should flush epoches in an order that is consistent with these dependencies. This paper proposes proactive flushing, which means that an epoch should be flushed as soon as it completes. We describe the process as follows. First, every L1 cache controller tracks a FIFO list of global epoch IDs in the order that they are started at this core. Then for the epoch at the head of the list, it checks the dependent table to see whether the epoch is a dependent of a commit dependency. If it is the case, then the L1 cache waits until the source of the dependency commits. Otherwise, it writes back all dirty cache lines in that epoch, and notifies the LLC controller. On receiving the notification, the LLC then initiates a write back to the NVM. After writing back all dirty lines to the NVM, the LLC controller sends an ACK message back to L1. On receiving the ACK, the L1 knows that the previous epoch has been persisted, and then it continues flushing the next epoch from the list of the queue.

If the LLC is partitioned across cores, which is not unusual in modern architectures, then the epoch flushing process would be more complicated, because an LLC partition only maintains cache lines within a certain address range, and it is not aware of the flushing progress in other LLCs. In this case, the L1 controller initiating an epoch flushing should broadcast the flush message to all LLCs. After persisting all dirty lines from the epoch, the L1 controller waits for ACK message from all LLCs, after which it broadcasts a second ACK back to all LLCs. On receicing the second ACK, LLCs know that the epoch whose ID is indicated by the ACK message has ended, and that dirty lines from later epoches can be safely evicted or flushed. Each LLC should also keep track of the most recent epoches from each L1 that have been persisted. On completion of an epoch flushing, L1 controllers check the dependent table to see if the epoch just persisted is the source of any dependency edge. If it is the case, then the L1 controller also sends messages to other L1 controllers to notify them of the completion of the source epoch. On receiving such a notification, the receiving L1 controller removes the corresponding entry from the source table, and unblock epoches flushes if they are waiting for the completed epoch.

If an LLC is about to evict a dirty cache line, it must be guaranteed that all epoches before the epoch in which the dirty line was written have been persisted. This can be easily checked, recall that we require all LLCs to keep track of the most recent persisted epoches from all L1 controllers. If there are unflushed epoches, the LLC controller sends a message up to the L1 controller to request a flush. Note that if the epoches to be flushed also have inter-thread dependencies from other L1 controllers, this flushing chain can be quite long and time consuming, which may stall the LLC controller for a while. This paper does not seem to address this problem.

Deadlock may occur if we allow dependencies to form without constraints. For example, if epoch A reads a cache line generated by a concurrent epoch B on another core, and later on epoch B reads a cache line from A. Both epoches will add each other as a source and a dependent, which prevents them from completing. To prevent forming dependency cycles (in theory, the cycle can be much longer than just having two elements, which makes it harder to detect), an epoch is broken into two sub-epoches: On receiving a conflicting access from another epoch, the source epoch immediately ends, and a new epoch is started. Original epoch semantics is preserved, because breaking down one epoch into two does not harm the semantics. In addition, no dependency cycles could form, because it is guaranteed that the source epoch can only have one “out” edges, which is at the end of the epoch.