Asynchnized Concurrency: The Secret to Scaling Concurrent Search Data Structures
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2694359
Year: ASPLOS 2015
Keyword: Concurrency; Hash Table; ASCY
This paper proposes Asynchronous Concurrency (ASCY), a programming principle for designing concurrent data structures. The paper first identifies that the major source for scalability issues in concurrent data structures are store instructions to shared data which will cause these stores be serialized by the cache coherence protocol. In addition, without careful evaluation of these stores, they may also incur long-than-usual dynamic code path (e.g. in the form of retry), which negatively affects both overall performance and power dissipation. In addition, the paper also points out that optimizations to concurrent data structures are often limited to certain architectures and to certain workloads. If the optimized data structure is evaluated on a different platform or using a different set of workloads, they may even perform worse than unoptimized version. One of the example given by the paper is Read-Copy-Update, which is a commonly used concurrent data structure technique in Linux kernel. The paper points out that RCU is designed for read-dominant workloads, in which writes are very rare compared with reads. If the ratio of writes increase, the performance of RCU will drop sharply.
The paper proposes that concurrent data structures be evaluated on multiple platforms, workloads, and on several metrics. Metrics used by this paper include latency of operations, throughput of operations, power consumption, and distribution of latencies (tighter distribution implies more predictable performance). Furthermore, the paper also proposes that non-concurrent version of the data structure under evaluation can be used as an upper bound of performance to demonstrate the cost of synchronization. As reported by the paper, a “good” implementation is able to achieve less than 10% performance loss compared with the non-concurrent counterpart.
The paper then shows the four rules of ASCY using common data structures as examples. The first rule of ASCY suggests that operations that are semantically read-only (i.e. key lookup, range scan, etc.) should neither write to shared memory, nor block on a lock or retry. Here “write” includes acquisition and release of a lock, help-along other threads to complete an operation, and also updating states in shared objects. This rule implies that read should be lock-free, non-blocking, and should not help other threads even if the concurrency protocol allows transient states be fixed by all threads observing the state. To achveie this, updates on data structure objects must only transform the object from one consistent state to another, without the possibility of observing inconsistent partially updated values. In addition, reader threads should be able to figure out such transient states when they see one, and recover the consistent image of the object locally without helping along. In practice, some data structures, such as Harries linked list as pointed out by the paper, requires that reader threads fix transient states by physically removing nodes that are removed from the list using CAS. If the removal fails, the thread should retry traversal from the list head. The paper also shows 10% – 30% performance improvement after optimizing out the help-along for reader threads.
The second rule of ASCY extends the first rule by requiring that all traversals in the data structure should not perform any global writes, be blocked, or retry. Here “traversal” (the original term in the paper is “parse”, but I think traversal is more understandable and is actually commonly used for tree structures) is defined as the process of finding the object to update. Often, searching data structure insert and delete are implemented as a two-phase operation. In the first phase, threads traverse the data structure to locate the object where the search key is stored. This process is identical with or very similar to read-only operations. In the second phase, the object we just located in the previous phase is updated. ASCY rule #2 simply requires that the first phase of insert and delete operations behave as if they are read-only operations, and should not acquire locks or be blocked. Of course, since some data structures inevitablely involve some form of “help-along”, such as letting threads complete an operation collaboratively or perform cleanups to guarantee progress. In such cases, stores to shared memory can be allowed, but should keep the number of writes to an absolute minimum.
The third rule of ASCY further extends the second rule such that if an update operation could not proceed because certain checks fail (e.g. could not find the key for delete, or the key already exists for insert), the update operation should not perform any write on shared data except those absolutely necessary to ensure progress. This implies that condition check on objects must be lock-free regardless of concurrent writer threads, similar to read-only operations. Once the optimistic condition check passes, the object is locked and the thread checks that previous reads are still valid (i.e. their values have not changed since the optimistic read). Updates can then proceed in a single-writer multiple-reader manner. This resembles Optimistic Concurrency Control (OCC) method used in software transactional memory (STM), in which the write set is locked before we validiate the read set.
The last ASCY rule dictates that data structure updates (i.e. the second phase from above paragraphs) should perform wrires to shared variable only if synchronization is truly needed using that variable. Independent updates to semantically different objects should not therefore have any conflicting writes to the same address, since the coherence protocol will force these writes to be serialized, resulting in lower parallelism. Ideally, the number of wrires performed on shared data should be identical to the case with non-concurrent data structures, probably at the cost of performing more reads for validation. In practice, since it is often impossible to perform lock-free updates using hardware primitives, locks are still needed to synchronize properly. Even in this case, the paper suggests that the number of writes to shared data should be reduced as much as possible to minimize unnecessary synchronization incurred by cache coherence.
Using these four rules, the paper proposes a cache efficient data structure, Cache-Line Hash Table (CLHT). CLHT is a hash table using collision chain as its conflict resolution method. Elements with the same hash form a linked list of buckets, which store keys and values. CLHT features cache efficient design in a way that each bucket is a cache line sized object aligned to cache line boundary. This guarantees that all operations on a bucket requires exactly one cache line fetch. A bucket consists of three parts. The first part is a 8 byte word, which is used as a synchronization object, the details of which will be discussed later. The second part is key-value array, which stores keys and values in the bucket. Assuming eight byte keys and values, we can store three key-value pairs within a bucket, where each key and value is aligned to eight byte boundary. The last eight byte word is “next” pointer used to link all buckets together under the same hash value.
The item search and condition check protocol of CLHT are lock-free. Updates can be lock-based or lock-free. We first introduce the lock-based protocol, followed by the lock-free one. In lock-based protocol, the synchronization object is simply used as a lock to synchronize writers on the bucket. Readers are neither synchronized with writers, nor with each other. Writers on a bucket acquire the lock before they can update key and value. According to ASCY rule 1 - 3, read-only threads and updating threads need to first obtain an atomic image of key-value pairs in the bucket before they complete or decide whether to perform the update. This atomic key-value pair read is achieved by first reading the value field, then reading the key, and eventually we read the value field again to validate. An atomic snapshot is obtained if both value reads match. Since each update operation only updates one key-value pair, this operation is guaranteed to always either serialize before or after a concurrent update. To elaborate: The validation succeeds iff the value is not updated between the two value reads. This implies that, as long as validation succeeds, for any concurrent key-value pair updates, either both key and value update happens before the first value read, or the value update happens after the second value read. In the first case, the read is serialized after the update by observing its writes. In the second case, if the key is read before the key insert, then this bucket is empty, and search will not succeed. If the key is read before a key delete, the read is serialized before the delete, which returns the value of the key before it is deleted. If, however, that the key is read after the key update. If this is an insert, the value will be NULL, and the reader knows this is an incomplete snapshot, which will cause a retry. If this is a delete, the key will be NULL, and the reader is serialized after the delete despite the fact that we read a non-NULL value. Inserts and deletes simply acquire the lock and update key-value pair accordingly. Deletes need to clear both key and value fields to ensure readers can always identify an incomplete delete and then retry.
In the lock-free version, the synchronization object at the head of the cache line is divided into a version number, which is incremented every time an update is performed, and three status fields to indicate the status of the three key-value pairs (free, occupied, under modification). Reads are performed in the same way as described above. Update threads must guarantee that after checking the bucket, the condition must hold before it finishes th update. Two CAS operations are required to perform the update. After checking the condition, if the operation can proceed, the thread first uses CAS to allocate a slot (or invalidate the slot in the case of delete) from the three “status” fields. For deletes, the thread also increments the version number in the same CAS, and then returns. For inserts, the second CAS is performed after the key-value pair has been written. The second CAS increments the version number using the version number before it checks the condition as old value. If the second CAS fails, implying that the bucket has been updated by a concurrent thread, the inserting thread must release the slot, recheck the condition, and try again.