(Almost) Fence-Less Persistent Ordering



- Hide Paper Summary
Paper Title: (Almost) Fence-Less Persistent Ordering
Link: https://www.microarch.org/micro53/papers/738300a539.pdf
Year: MICRO 2020
Keyword: NVM; Persistent Barrier; Themis



Back

Highlight:

  1. Using non-temporal store to write undo log entries to avoid cache pollution, and for shorter latency since only one round-trip is required instead of two.

  2. Non-temporal stores have shorter data path compared with temporal store and clflush. This can be taken advantage of to partially overlap the data path latency difference and persistence latency

  3. Most importantly, write ordering enforcement should not stall the pipeline, i.e., it is fine to perform them on the background, but preventing stores from being committed in the ROB is definitely not a good place to enforce such ordering.

This paper proposes Themis, a lightweight persistent ordering architecture for NVM applications. The paper begins by observation that most NVM applications, if not all, uses persistence barrier to guarantee write ordering. Using undo logging as an example, the log entry must reach the NVM before dirty data does, because otherwise, if the system crashes before the two persistence operations, dirty data would not be able to recovery to the pre-image before the transaction, corrupting program data.

The paper then claims that persistent barriers are detrimental to performance, mainly because: (1) They require a cache line flush followed by memory fence, which prevents any possible coalescing and reordering of NVM writes; (2) The sfence will prevent instructions after it from committing in the ROB, which stalls the pipeline given the relatively long latency of NVM writes.

The paper also makes a few observations on common usage patterns with persistence barriers. First, non-temporal stores (e.g., movnt) bypasses the cache hierarchy, and can therefore be used to write log entries to the NVM. The advantage of using non-temporal stores is to avoid bringing the address used for logging into the cache, causing cache pollution, since the log buffer will never be read again during normal operation. Second, the persistence barrier works equally for both temporal stores (i.e., regular mov) and non-temporal stores, except that the cache line flush is unnecessary for non-temporal stores. The sfence instruction can be employed to order non-temporal and temporal stores without distinction. Third, non-temporal stores typically reach the memory controller faster, since they are generated earlier, and they have shorter data paths. To elaborate: A non-temporal store will be directly sent to the memory controller after they are generated, while a temporal store followed by cache line flushes need to go through the entire cache hierarchy, which costs more cycles. Lastly, modern memory controllers are in the persistence domain, thanks to the Asynchronous DRAM Refresh (ADR) feature. Once a memory request reaches the memory controller’s queue, they are guaranteed to be persistent even on power losses. It is, therefore, sufficient to declare that an NVM write has been persisted once the processor received the ACK packet from the memory controller.

The paper assumes the following architecture. The non-temporal stores are handled by a write-combining buffer (WCB) connected to the Load-Store Unit (LSU). The LSU directly puts the request into the WCB without invoking coherence (it is unclear, however, whether the L1 cache will be updated if the block to be written is already in L1 with sufficient permission. This does not affect the correctness of the design, though.). Write requests in the WCB are coalesced (writes to the same cache line on different locations are combined into a single request), combined (writes to the same location in the same line are also combined) and reordered for better write performance. Once the arbitrator grants bus access, and the memory controller indicates that it could accept more requests, the WCB controller sends a request to the controller, and wait for the ACK. The WCB maintains three pointers, a tail pointer, where new requests are added; A head pointer, where existing requests are removed and sent on the bus; An ACK-head pointer, which lags behind the head pointer, and it points to the last request, in the queue order, that has not been ACK’ed by the memory controller. The paper notes that the ACK-head pointer, in fact, indicates the current progress of persistence in the WCB, since any writes that have not been ACK’ed by the memory controller are either still in the queue, or are still being transferred by the on-chip network.

Temporal stores are handled by the cache hierarchy and coherence protocol as usual. When a clflush instruction hits a line, the line is scheduled for eviction, which is sent to a write back buffer (WBB). Each level of the cache has a WBB, but this paper focus on the WBB at L1 cache in particular. Lines in the WBB are queued until the next level cache is able to handle the eviction.

The Themis design is based on the observation that, as long as a temporal write is evicted from the WBB after a non-temporal write is removed from the WCB after being ACK’ed, it is guaranteed that the former is ordered after the latter. In the context of undo logging, this implies that, as long as the undo log entry is persisted using non-temporal writes, and that data is persisted using temporal writes and clflush, the write ordering between this two can be enforced with litter performance overhead, since (1) At this time, both writes have been committed in the pipeline, and therefore no pipeline stall would occur; (2) The temporal write naturally will traverse through a longer data path than the non-temporal write, which overlaps the the time difference with persistence delay of the non-temporal write.

The Themis design modifies cache hierarchy components as follows. The three pointers in WCB are all extended with two extra high bits, with the extra bits tracking the number of wrap-arounds since the last reset. For example, for a 16 entry WCB, the original pointers have four bits, and Themis adds two extra bits at the top. During execution, the lowest four bits are still used to index into the WCB, but when a wrap-around of the pointer value occurs (i.e., when the pointer advance from entry 15 to entry 0), the bit is carried to the high two bits. This way, the pointer value will not suffer from alisaing problems when it wraps around, since these pointers will be used as a “snapshot” to track the progress of non-temporal writes. Imagine if a temporal store is inserted into the WBB when the WCB tail pointer is 10 and the ACK-head pointer is 6, and later on the WCB advances such that both the ACK-head and the WCB tails wrap around and become 1. In this case, the WBB entry can, in fact, be safely written into L2, since the write ordering has already been solved. It is, however, unknown to the hardware whether this is a result of wrap-around, or that the ACK-head pointer has not reached the tail location. With the two extra bits, the ACK-head pointer will be of value 17, which is larger than the tail value snapshot, which is 10. The hardware then knows that all write orderings have been resolved, and can safely write the WBB entry to the L2.

The L1 data cache is extended with an extra small tag storing the value of tail pointer of the WCB. This pointer is updated when the L1 cache line is written into, and it serves as a “snapshot” of the WCB, remembering the current latest writes in the buffer. In order to enforce a write ordering between the current cache line and all WCB entries, the cache line must not be written back to the L2, before all entries older than the tail location are drained to the memory controller and ACK’ed. The tag will also be sent to the WBB when the line is evicted either naturally or by a cache line flush instruction.

The WBB periodically reads values of the ACK-head pointer from the WCB. Each entry of the WBB is extended with a tag field storing the tail pointer snapshot evicted from the L1 data cache together with cache line data. Cache lines that must follow the write ordering are not allowed to leave once they enter the WBB (which is marked by a special bit), while normal cache lines are unaffected (the tag field for these lines can be ignored). When the value of ACK-head changes in the WCB, the new value is sent to the WBB, and the WBB controller scans all its entries, and mark those lines whose tail snapshot is smaller than or equal to the ACK-head as eligible for removal.

When WCB pointer wraps around (actually, the tail pointer will always wrap around first), the WCB is first fully drained to avoid any aliasing problem when the tail pointer is reset to zero. Otherwise, L1 cache entries may be tagged with tail pointer being zero or a very small value, hence violating the write ordering, since the value is likely smaller than the current ACK-head, which immediately grants them the permission to be written back to the L1, but in fact non-temporal writes ordered before them may have not been fully persisted. Once the WCB is drained, all pointers of WCB are set to zero. The L1 data cache is also notified of the wrap-around event. The L1 cache controlle flash-clears all tail pointer tags when wrap-around happens, since all write orderings have been resolved after this point. The WBB is not affected by wrap-around, since it keeps monitoring ACK-tail.