Ribbon: High-Performance Cache Line Flush for Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/pdf/10.1145/3410463.3414625
Year: PACT 2020
Keyword: NVM; CLF; Ribbon
Highlight:
-
Degree of parallelism will affect NVM throughput as too much contention on the WPQ is harmful (but the paper does not explain why, maybe it harms locality? NoC traffic?). The solution is to delegate CLFs to helper threads in the background, and adjust the number of helper threads to throttle parallelism. Note that this does not optimize out memory fences, as memory fences still blocks execution to avoid out-of-order stores (i.e., if fences are also inserted into the buffer, stores after the fence are executed, but the corresponding fence is in the buffer and may not yet be executed, so stores are actually executed before the fence).
-
The overhead of CLFs can be reduced by proactively performing them before the actual CLF. This has no effect on correctness, since a dirty block is prone to eviction the moment it is made dirty.
-
Objects that are semantically related to each other should be allocated together at aligned addresses to minimize the number of flush instructions.
Comments:
- Blocking the execution on store fences is relatively heavyweight compared with stalling the pipeline, as the latter only waits for the memory controllers to acknowledge CLFs, while the former waits for an asynchronous thread to drain the buffer.
This paper presents Ribbon, a software framework for optimizing cache line flush (CLF) operations on Non-Volatile Memory. Cache line flush operations, or CLFs, are critical in NVM programming, as it controls the order of writes that are persisted on the NVM device. At a semantics level, the CLF instruction evicts a cache block on the given address from the memory hierarchy and it has global effect (i.e., this instruction is also cache-coherent). If the cache block is dirty, it will also be written back to the underlying backing store. On x86 platform, CLF can take several forms, such as clflush, clflushopt, and clwb. All these three forms are weakly ordered, meaning that CLF on x86 can be executed in arbitrary order. In typical usages of CLF, one or more store fence will also be inserted to enforce the ordering that CLFs after the fence will not be executed until all previous CLFs are completed.
The paper is motivated by the fact that CLFs will significantly affect overall system performance, as enforcement of write orderings will likely to impose back-pressure to the pipeline, causing pipeline stalls, especially when the Write Pending Queue (WPQ) at the memory controller side is full (as it takes longer for the WPQ to accept a new write back block). Besides, the degree of write back parallelism can also negatively impact the throughput of persistence, as too many threads contending for the WPQ may cause congestion on the memory controller as well as the NoC bus. On the other hand, if the degree of parallelism is low, the throughput of persistence will not match up with the throughput of the device, which is also sub-optimal. Lastly, the paper also observes that the dirtiness of the block also plays an important role in NVM resource utilization. If a cache block is only lightly modified, i.e., most of its contents are unchanged compared with the NVM image, writing back an entire block will waste bandwidth as only a small portion of data needs to be updated.
The paper also noted that previous works tend to address these issues by proposing new persistence models and/or new hardware, which will change the usage model of NVM and is not backward compatible will existing binaries. Ribbon, on the contrary, is implemented purely in software, and it is compiled with application source code as a library. Ribbon also preserves backward compatibility by requiring only minimum changes to the source code.
From a high level, Ribbon is implemented as a software library that wraps the CLF and memory fence primitives. Applications programmers should use these wrappers to invoke Ribbon’s instrumentation. Ribbon also provides an initialization and cleanup function that should be called at the beginning of the application and at the end of the application, respectively. These two functions are responsible for managing data structures and helper threads, as we will see below.
We next discuss Ribbon’s optional details. The first feature of Ribbon is CLF decoupling, which decouples CLF from program execution, allowing them to be completed in the background. To achieve this, each user thread is allocated a FIFO CLF buffer for tracking the CLF operations issued from the thread. CLF operations are wrapped such that instead of executing the CLF primitive, it simply enqueues an entry into the per-thread CLF buffer that describes the CLF operation. To perform background persistence, Ribbon spawns several background CLF helper threads that monitors these per-thread buffers, and execute the actual CLF primitives on behalf of the foreground user threads.
The number of helper threads determine the degree of parallelism for persistence. As discussed earlier, too many or too few concurrent flushing threads will both harm performance, but somewhere in between there is an optimal value that can achieve the peak NVM bandwidth. Ribbon therefore determines the optimal number of threads by trial-and-error, which we describe as follows. Ribbon assumes that the number of helper threads must be between a lower bound (for example, 4) and the number of physical cores in the system. The runtime measures the throughput in both cases, which we denote as P1 and P4 respectively. The runtime then measures throughput using one more thread than the lower bound, and one less thread than the upper bound, which we denote as P2 and P3. The optimal number of threads lie on the intersection point between the two lines connecting (P1, P2) and (P3, P4) on the thread count-bandwidth graph. This sampling process is performed for every one second to adapt to potentially changing workloads.
One thing that is worth noting is that memory fences are not inserted into the FIFO buffer to maintain the
ordering between the fence and stores after the fence. Otherwise, stores after the fence will be executed immediately,
but the fence may still be in the buffer waiting for the worker thread to drain the buffer. This not only violates the
semantics of memory fences, but also threatens the correctness of the program, since the dirty block updated by
the store can be evicted earlier than blocks that are to be flushed by CLFs before the fence.
To this end, store fences are emulated in the foreground on the CPU by blocking execution until the per-thread
buffer has been fully drained.
The second feature of Ribbon is pro-active CLF, which is based on the observation that frequently modified cache blocks can be flushed to the NVM even before the CLF primitive to overlap execution with persistence even more. The paper proposes using the precise event counter on modern platforms to monitor frequently executed store instructions and their precise addresses. These addresses are read by Ribbon’s helper thread, for which CLF instructions are issued. The paper argues that issuing extra CLF instructions will not change the semantics of the program, as even without these proactive CLFs, dirty cache blocks can potentially be evicted any time after it is updated. Though, if a block is updated frequently, evicting the block too early may incur extra cache misses and fetch/evict traffic on the memory bus. Ribbon may simply disable this feature if NVM throughput is harmed.
The last feature is coalesced CLF. The paper observes that the dirtiness of blocks being written back is generally low
in some workloads, either due to objects not being aligned to cache block boundaries, or due to objects that are
semantically updated and flushed together are not allocated together.
To address the first issue, Ribbon modified the memory allocator to always align certain objects on cache line
boundaries. For the second issue, Ribbon proposes that the programmers should use a special memory allocation
interface to allocate semantically related objects together, such that both can be written back using fewer
cache flushes. The paper uses key and values objects in Redis as an example. If these two objects are allocated
separately, each will occupy one or more blocks, which need separate CLFs to write them back. If, however, both
objects are allocated at aligned, consecutive addresses, then a few CLFs can be saved, as these objects are
stored compactly.