COSPlay: Leveraging Task-Level Parallelism for High-Throughput Synchronous Persistence



- Hide Paper Summary
Paper Title: COSPlay: Leveraging Task-Level Parallelism for High-Throughput Synchronous Persistence
Link: https://dl.acm.org/doi/10.1145/3466752.3480075
Year: MICRO 2021
Keyword: NVM; Persist Barrier; COSPlay



Back

Highlight:

  1. Latency overhead of sfence can be hidden with a technique similar to how OS hides I/O latency, i.e., by scheduling another logical execution flow. This requires fast and fine-grained switch to a different flow, which can be achieved by coroutines.

  2. sfence instructions have a global effect, i.e., it blocks the allocation of the store queue for all future ordered operations (stores, clwb, clflush). When multiple coroutines can issue their own stores and flushes to the store queue, this execution model is overly strong, as these coroutines are supposed to be independent from each other, and hence their flushes do not need to be ordered.

  3. Adding context IDs to coroutines and store queue entries enables the fencing logic to only fence on a particular context ID, while not affecting other IDs. This essentially virtualizes the store buffer, as it appears as multiple independent buffers, where ordering is only enforced within each buffer, but not across.

Comments:

  1. My biggest concern is that this mechanism require a few co-routine tasks to be assigned to the same core for fast switch (b/c otherwise they will not share the same store buffer anyway). This would have serious performance implications because of resource contention, locality problems, etc. Servers typically only schedule one or two threads per core for this exact reason. Although the paper uses a different switching mechanism (coroutines), the resource contention problem is not changed from the conventional thread model.

  2. This is like hyper-threading, why not just extend hyper-threading such that threads will switch when an sfence is active? How many coroutines can fully saturate the core pipeline to avoid stalls? Surely this number will not be very large, due to the short latency of NVM?

  3. I do not quite get why the authors assume that clflush instructions are first queued in the store queue / write buffer, and then moved to the Write Back Buffer (WBB). Why not just add it to WBB on decoding? I understand this is because the flush and earlier writes to the same address still have an order, and maintaining them in the same FIFO store queue can simplify the ordering, as it is trivially implied by the queue order. But if you directly add flushes into the store queue, but not set the “ready” bit, and then when a store retires, check the WBB, and set the “ready” bit, wouldn’t that save a few store queue entries? OK, what I just described requires associative lookups for every store queue retire, but it allows flushes that do not have dependencies to be executed as soon as it is available, and it saves store queue entries.

  4. It makes perfect sense to call it COSP rather than COSPlay.

This paper introduces Coroutines for Synchronous Persistence (COSPlay), a hardware-software co-design that exploits task-level parallelism from the software side, and virtualized write buffer on the hardware side, for improved persistence throughput. The paper is motivated by the fact that current support for persistence ordering on commercial processors is limited to per-core barriers that block all future store and flush operations while existing outstanding flushes are being served. This coarse-grained mechanism guarantees safety, but causes unnecessary pipeline stalls, since not all store and flush operations waiting to be executed are dependent on the current persistent operation. This paper proposes that: (1) Stores and flushes should be made context-tagged, and the store buffer holding committed store operations should be virtualized such that entries from different contexts can be distinguished and processed independently, which maximizes the throughput; (2) Application threads should be switched out while a persist barrier is in effect, overlapping the persistence overhead with the execution of another thread.

The paper first briefly discusses the x86 persistency model. Due to the existence of the write-back cache hierarchy, the ordering of NVM writes on x86 platform for cache-able memory is decoupled from the regular consistency order. Programmers should manually issue cache line flush or write back instructions (without losing generality, we use clflush to represent both in the rest of this article) to force the cached content to be written back to the NVM, and the order of these write backs that is actually observed on the NVM side is not guaranteed to be consistent with the order of flush instructions in program order for two reasons. First, the processor does not enforce ordering between flush instructions on different addresses, meaning that they could potentially be reordered when deemed necessary by the processor. Secondly, even if flush instructions are strongly ordered, the NoC network connecting the store queue and the NVM controller may still deliver messages out-of-order, making the task of inferring write ordering impossible without proper fencing.

Luckily, x86 provides a store fence instruction, namely sfence, to order flushes and stores. The sfence instruction, when decoded in the frontend, will block future stores and flushes to be allocated in the store queue, until the queue has been drained. This conservative implementation essentially stalls the pipeline while outstanding clflush operations and store instructions are being serviced, the performance impact of which is dependent on the aggregated latency of these operations. The paper conducted studies on the slowdown caused by the store fence, and unsurprisingly, as the memory access intensity of application increases, the performance becomes worse than non-fenced executions. In addition, the paper also points out that Backend Memory Operations (BMOs) will further aggravate this problem, since BMOs are on the critical path of writes, and will make the latency longer. Evaluation shows that as the latency of BMOs becomes longer, the overall performance also degrades proportionally.

The paper observes two types of inefficiencies in the above persistence model. First, the processor pipeline is stalled while waiting for the store buffer to drain out, while there may be some other logical tasks that are ready to be executed. This is analogous to the classic I/O latency-hiding problem, which is solved in modern systems by switching to a different thread and overlaps the execution of that thread with I/O. Performing thread switches at long latency sfence instructions, however, are unrealistic, both because there is no notification mechanism on sfences (unlike blocking on I/O operations, in which case the OS issues the I/O and switches out the thread), and because of the switching overhead, which is a few orders of magnitude larger than the NVM latency. To enable fast task switches without incurring significant overhead, the paper proposes using a lightweight task-based execution model within a single thread, implemented as software coroutines. In the proposed execution model, tasks are self-contained, logical executions of a single operation, which may perform NVM writes and use persist barriers to commit the write. Different tasks do not communicate with each other to avoid complication of synchronization, and each task can be scheduled and executed till the end. Three utility functions control the lifetime and scheduling of the tasks: TASK_START(), TASK_END(), and YIELD(), with the first two creating and terminating tasks, respectively, and the last one relinquishing control over to the scheduler, which is called right after the flush instructions in the persist barrier, but before the sfence. On a YIELD() call, control is transferred to the scheduler, which selects another waiting coroutine in a round-robin manner to execute. Note that the sfence instruction is still necessary, even if it is only executed after the coroutine is scheduled again after issuing flushes and calling YIELD(), because at that moment, it is not still unknown whether the previous flushes have completed, and the sfence is just a conservative measure to ensure safety.

The second type of inefficiency is the fact that the effect of sfence is global, meaning that it blocks the execution of the pipeline until all operations in the store buffer have been drained, regardless of whether these operations are actually related to the flushes that the sfence is intended for, or just some innocent clflush from another logical task that has no dependency on the current one. The negative impact of this feature is even more obvious with coroutines, where multiple tasks could have issued their own clflush instructions, being switched out, switched in again, and then an sfence is executed (since YEILD() will return right before the sfence of the persist barrier). This sfence instruction will still stall the core until all earlier flushes are completed, while theoretically speaking, only flushes that are issued by the coroutine issuing the sfence need to be drained. To address this issue, the paper proposes that all coroutines be assigned an unique context ID, which represents the logical flow. Stores, flushes, and sfences issued by the coroutine are tagged with the context ID, and they carry this ID to the store queue, which is also virtualized by adding a “context ID” field to each entry. The sfence instruction then only waits for the store queue to be drained of stores and flushes of the same ID, before it retires and unblocks the pipeline. The context ID is dispensed by either software or hardware, and is part of the coroutine context that needs to be switched in and out regularly as coroutines are scheduled. The paper also noted that aliasing of context IDs, which happens when the number of concurrent coroutines scheduled on the same core exceeds the maximum allowed, does not matter, since context ID only denotes the dependency relation between logical tasks, and aliasing of the ID only adds unnecessary dependencies, which preserves correctness, and only incurs slight performance overhead.

To avoid an associative lookup on every store buffer retire (to check whether all operations with the same context IDs have been drained), the paper also proposes adding a direct-mapped table from the context IDs to the number of outstanding operations of the same ID in the store queue. Every time an operation retires, the table is updated using the context ID of the operation, and the counter is decremented. Once the counter reaches zero, the pipeline is notified that the corresponding context ID has been drained, and the sfence is retired as well.