Predictive Log-Syncrhonization
- Hide Paper Summary
Link: https://dl.acm.org/ft_gateway.cfm?id=1217965&type=pdf
Year: EuroSys 2006
Keyword: Log-Structured; STM; Lock-Free
Designing concurrent data structure is, in general, a difficult task. For small-scale applications, a single lock that blocks access to the entire object suffices. On large-scale data processing platforms with modern multicore architecture, however, the overhead of inter-thread communication and the bottleneck of blocking render the simple approach unattractive. Writing efficient concurrent data structures thus require a thorough understanding of the internals of the data structure, as well as knowing the intrinsic level of parallelism of certain implementations.
This paper, on the other hand, proposes Predictive Log-Synchronization (PLS) for parallelizing existing serial implementations of data structures, without having to fully understand the internals (it helps if one does have a solid understanding). PLS extends existing sequential version of a data structure with a log, which records logical operations that worker threads have performed. One of the worker threads modifying the data structure periodically acquires a lock before it “replays” the log to apply all the pending changes. Other worker threads, while the log is being replayed and the current instance becomes unstable, predicts the results of their operations. The prediction process works by reading from a duplicate of the last stable instance, and then traversing the read-only log, with the help of a few substate functions. Substate functions apply transformations to information gathered from the stable instance according to operations performed by all threads. Once the predicting thread finishes traversing the log, the final result is returned as if it is from the object with all log entries applied. After the log replaying thread applies all entries to the instance, an instance switch happens. The instance switch swaps the status of the two instances, making the one with all log entries applied the stable instance, while the other one writable. This is realized using a global version counter: the replaying thread increments the counter atomically. Other threads choose the corresponding instance to read from according to the least significant bit of the version counter.
On initialization, PLS creates two empty instances of the data structure, and sets the version counter to zero. PLS also initializes two log objects, which are simply arrays of operations which can be traversed from the head to the tail, and inserted at the tail atomically. Operations are inserted by threads that modify the data structure. One of the two instances is called the stable read-only instance, another being called the unstable instance. Each instance is also associated with a spin lock. Threads try to acquire the spin lock after they have inserted their operations at the end of the log. The winner thread then proceeds to replay the log onto the unstable instance, while other threads simply read the stable read-only instance, and predict the result of their operations. We elaborate this part in the next paragraph in detail.
Operations are categorized into three types: write-only, read-modify-write (RMW), and read-only. Write-only operations are those that modify the content of the data structure, without carrying a return value. Their effect can only be observed by other threads via internal change of the data structure. Read-only operations, on the other side of the spectrum, only returns value to the caller but does not modify the data structure. RMW operations both modify the data structure and carry a return value. Threads executing a write-only operation inserts the operation descriptor and argument into the log, and then returns. The actual execution of the operation will not take place until a thread replays the log. For RMW operations, the thread inserts the operation at the end of the log as with write-only operations. Then, in order to compute the return value, the thread must partially replay the log as if all operations before its own operation have been applied. The partial log replay is realized using a mechanism called “substate”. In the context of PLS, substate refers to a subset of states in the data structure that are relevent to an operation of interest. The thread first queries the read-only copy using the argument of the operation to extract the initial substate. Then it traverses the log and check if the operations in the log are relevant to the substate, and if yes, then updates the substate according to the semantics of the operation accordingly. Finally, the thread stops traversing the log once it reaches the operation of itself, and then returns. Read-only operations also relies on substate. Instead of replaying the log on the substate until all operations before the current operation are exhausted, read-only operations are not inserted into the log, and the substate replay can stop early at the last RMW or write-only operation inserted by the same thread. This constraint is added to guarantee sequential consistency, because otherwise the read-only operation could be fulfilled simply by reading the stable copy, ignoring all modifying operations that happen before the read-only operation in program order. The problem, however, is that the implementation is not linearizable, which is a stronger property than sequential consistency. Linearizable objects are composable, which means that a system consisting of multiple linearizable objects is itself linearizable. The same condition does not hold for sequentially consistent objects, as a system consisting of sequentially consistent objects can still observe deviate from the program order.
While the stable instance is being read, it is possible that the log replaying thread finishes, and then switchs instances by incrementing the version counter. If the read operation still has not finished after the next round of log replay, this time on the previous stable and currently unstable instance, begins, then inconsistency might occur as a result of concurrent read and write. To avoid the reading thread accessing inconsistent substates, a reference counter is maintained for the stable copy. Every reading thread increments the reference counter atomically before it accesses the stable instance. A log-replaying thread will not begin until the reference counter drops to zero.
When a log-replaying thread finishes applying all log entries onto the unstable instance, an instance switch takes place. As described earlier, the instance switch is atomic with regard to all threads by using a version counter. After the switch, the two instances exchange their roles, and the spin lock is released. At the next round of log replay, the winner thread should first replay the log in the previous round onto the current unstable instance. This stage is called “adjustment” as it “adjusts” the current unstable instance to catch up with the latest updates. One of the delicacies about adjustments is that they can be usually carried out more efficiently than log replay. This is because adjustments can take advantage of the fact that they have actually been executed by the previous log replaying thread on the unstable copy. It is thus possible for the previous thread to leave “hints” to the adjusting thread, accelerating the adjustment process.