Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming



- Hide Paper Summary
Paper Title: Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming
Link: https://dl.acm.org/doi/10.1145/2815400.2815406
Year: SOSP 2015
Keyword: RCU; RLU; STM; Read-Log-Update



Back

Highlights:

  1. The conventional RCU paradigm has limited writer parallelism, can only support one pointer swing, and needs to wait for grace periods.

  2. We can improve RCU by using a two-version MVCC scheme where reader-writer synchronization is realized using a global timestamp clock, and writer-writer synchronization is realized with regular per-object locks. The grace period can also be overlapped with regular execution by lazily performing GC on the older of the two versions.

Comments:

  1. On page 8, top left, second line, “since this clock only gets updated after RLU synchronize calls”. I could not see how it is the case. In the pseudo-code the clock gets updated before the synchronize call. In an earlier statement of the section, we also have “only in this case, the writer sends a ‘sync request’ to the conflicting thread to force it to release its locks, by making the thread increment the global clock, execute RLU synchronize, write back, and unlock”.

This paper presents Read-Log-Update (RLU), a parallel programming paradigm that replaces Read-Copy-Update (RCU) as a lightweight synchronization primitive. RLU overcomes the limitations of RCU by providing a higher degree of writer parallelism and better flexibility. To achieve this goal, RLU adopted the timestamp-based synchronization algorithm from Software Transactional Memory (STM) designs and tailored it for the specific challenge that RLU aims to address. Compared with RCU, RLU demonstrates higher operation throughput and better scalability on certain commonly used data structures, which proves to be a feasible alternative to RCU for manipulating concurrent data structures.

RCU is a lightweight synchronization primitive that has been gaining popularity in Linux kernel for manipulating pointer-based kernel data structures. Compared with conventional lock-based synchronization primitives, RCU liberates readers of the data structure from acquiring the lock. Instead, readers can directly access the data structure within the declared critical section (using API calls rcu_read_lock and rcu_read_unlock). However, for writers, RCU requires that writers must first copy the object it intends to modify into a newly allocated private object, then apply the writes to the private object, and finally publish the new object to future readers by atomically switching the pointer that points to the old object to the new one. In addition, writers must also wait for all readers to give up their reference to the old object after it is unlinked from the data structure before the old object can be reclaimed. This process is implemented in the API call rcu_synchronize, which waits for all readers, after the time point it is called, to experience a quiescent period where the reader does not hold any reference to any object in the structure. The wait time is called a grace period and is essential for the functional correctness of RCU.

The paper identifies three problems with the existing RCU implementation in Linux kernel. First, RCU does not provide any mechanism to synchronize between writers, and therefore writers must be synchronized via some other mechanism such as global spin locks, hence limiting the degree of RCU’s parallelism. Second, RCU only supports a single pointer switch since the “update” step must be conducted atomically. This constraint greatly limits the flexibility of RCU as many common data structures must be updated with multiple pointer swings. For example, in Linux kernel, doubly linked lists must only be traversed in one direction if used with RCU to avoid seeing inconsistent backward pointers. Lastly, the grace period forces writers to wait for readers, which further degrades write performance. In addition, as the number of cores increases, the grace period will generally become longer as the writer must waiter for every reader core to go through at least one quiescent period.

To address the above problems, the paper proposes RLU which (1) enables multiple writers to proceed as long as they operate on different data items; (2) allows a writer to commit multiple updates atomically even if these updates require more than one pointer swings; and (3) naturally supports the deferral of memory reclamation for unlinked objects, hence removing the grace period from the critical path. Overall, RLU is potentially more efficient than RCU especially on write-dominant workloads due to its higher degree of writer parallelism and lower overhead of memory reclamation.

We next describe the design of RLU in more detail. In RLU, every thread has a private log buffer that holds copies of shared objects when they are being modified for the first time by the critical section. Objects in the log buffer will also be published to other readers when the writer completes as we show later. Each shared object has a header that contains a pointer to the copy of the private object in the log buffer. This pointer is also treated as a lock, i.e., if the pointer value is NULL, then the object does not have a private copy in the log buffer of one of the writers. Consequently, readers can directly access the object without any synchronization. In order to support atomic commits of multiple written objects, RCU also has a global lock, which is implemented as a shared timestamp counter. As we will see later, the counter indicates the current most up-to-date logical snapshot. Correspondingly, every writer thread also has a writer timestamp, which indicates the snapshot that the writer generates. This writer timestamp is initialized to infinity when a thread is created, meaning that the thread does not hold any memory snapshot in its private log buffer. Reader threads, on the other hand, keep a reader timestamp to indicate the memory snapshot it is bound to access.

Reader threads in RLU delimit the read critical section using the same API as in RCU. At the beginning of the read critical section, the thread reads the value of the global timestamp counter and saves it as the read timestamp. When the reader intends to access an object, the access wrapper function first reads the pointer value from the object header. If the pointer is NULL, the reader directly accesses the object as the object is the most up-to-date version. However, if the pointer is not-NULL, indicating that the object is currently locked by another writer thread, the reader thread should then follow the pointer to locate the writer thread (using the log buffer’s address) and compares its read timestamp with the writer thread’s writer timestamp. If the latter is larger, indicating that the writes happen logically after the reader threads, then the reader thread still accesses the original object. This condition also covers the case where the writer thread has infinity as its writer timestamp. Otherwise, if the reader timestamp is larger, then the reader will read from the writer’s private log buffer as the reader logically starts after the writer commits its modifications.

Writer threads, when they modify an object, they need to first allocate the object in its private log buffer, and then lock the object by updating its header pointer field. Multiple objects can be modified and added to the log buffer, unlike in RCU. In order to lock an object, the locker thread uses atomic Compare-and-Swap instruction to install the pointer to the private copy in the header field. If locking fails, then the writer should relinquish all previously locked objects and retry. After the writer thread completes its critical section, the thread will then atomically commit all its updates to concurrent and future readers as follows. First, the writer thread sets its write timestamp to one plus the current global timestamp. At this moment, the updates remain invisible to concurrent readers, as these readers will only obtain reader timestamps that are smaller than the writer timestamp. Next, the writer thread atomically increments the global timestamp counter, publishing its writes in the private log buffer to concurrent and future readers. The atomic increment operation serves as the serialization point between readers and the writer. Readers that obtain the timestamp before this point is logically ordered before the writer, and hence will not read the objects even if the objects they access are locked. Readers starting after this point, on the contrary, are considered logically after the writer, and as a result, these readers will find their timestamps to be greater than the writer timestamp when they access a locked object, and therefore, will read from the writer’s private log buffer.

After atomically committing the writes, writer threads, just like in RCU, need to wait for the grace period to pass before they write back the committed objects. The grace period in such a scenario guarantees that no readers will be accessing the objects when the write back is performed, hence preserving a consistent view of the shared memory. Writers wait for the grace period by repeatedly checking the reader timestamps of all other threads until all the reader timestamps are greater than its writer timestamp. After the write back concludes, the writer then clears the header field for all objects it has modified and resets the writer timestamp to infinity such that no future readers can hold any reference to objects in the private log buffer. Finally, after a second grace period, the writer thread reclaims the private object copies in the log buffer. The second grace period is implemented in the same way as in RCU (i.e., wait for every thread to experience at least one quiescent state).

In order to reduce the performance overhead in the writer’s commit protocol, most notably, the two grace periods, the paper proposes two optimizations. First, the second grace period can overlap with the writer’s regular execution, if the writer simply leaves the old log buffer alone and uses a new log buffer every time it starts a write critical section. Old log buffers are periodically checked and only reclaimed after at least one grace period has passed. Second, the first grace period can also be at least partially avoided by not eagerly performing the write back after the atomic commit. Instead, the writer can simply let the modified object stay in its private log buffer indefinitely until another writer attempts to lock the object for write. In this case, the second writer will force the first writer to perform the write back after the grace period has passed. Since the first writer only performs the write back when necessary, the wasted cycles on waiting for the grace period are likely to dramatically decrease as the grace period will likely overlap will the writers execution after the critical section.