CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2872400
Year: ASPLOS 2016
Keyword: CAS; Cache Coherence; CASPAR; Synchronization; Lock-Free
I don’t quite buy the argument that forwarding to a CAS whose new depends on old will squash the previous CAS. If the new is dependent on the old, this means the second CAS will read a value from the forwarded old pointer. Under the context of the first CAS, this means the second CAS will read a value from the new pointer. I cannot see how this can introduce data conflicts, as speculation begins only after the TL, but the new node is allocated before TL is executed in the case of a push, or the new node is derived from the old pointer and not read/written at all in the case of a pop. In neither case, the speculation of the first CAS-pattern will conflict with the second CAS.
This paper proposes CASPAR, a novel cache coherence extension to support efficient serialization based on Compare-And-Swap (CAS) primitives. CAS is often used as the synchronization primitive in lock-free programming. Threads access an object using a pointer P optimistically, assuming that no interleaving thread will change the state half way during the read. Updates are made to the object by creating a new object, and using CAS with the old value of the pointer used to access the object to atomically swap the new object to the pointer P. The paper identifies that lock-free programs written this way often suffer from limited scalability for two reasons. First, when the level of contention is high, the CAS is likely to fail by not observing another thread’s intervening update. To deal with such failure, programmers typically use a CAS loop to retry the read and update process, which only worsens contention on the variable, since threads will likely to spin for multiple rounds before the CAS eventually succeeds. Second, due to the fact that CAS will cause the cache line holding the variable to be acquired in exclusive ownership, this will essentially serialize the acquisition of the cache line containing the variable globally. A CAS cannot complete before the previous owner of the cache line finishes CAS and releases the line.
This paper proposes a hardware architecture to accelerate CAS execution on directory-based multicore systems. The proposal relies on the fact that in some commonly used lock-free data structures, such as stacks, queues, etc., the following pattern is used to append a new node into the structure. First, a new node is allocated and its content is initialized. The partial state of this step is not visible to other cores, since this happens in the execution stack of the current thread. Second, the head pointer of the structure is read into a local variable, H. Thrid, we set the “next” pointer of the newly allocated node to the value stored in H. In the last step, a CAS is executed using H as the old value and the newly allocated object as the new value, which (hopefully) adds the new object before the previous object in the head. If an intervening thread executes its CAS before the current thread does, changing the value of the head pointer, the CAS of the current thread will fail, which results in one or more retries.
The paper optimizes the above process based on the observation that thew “new” value is in fact independent from the “old” value of the CAS, and is available even before the old value is read from the head pointer. The second important observation is that, if all CASs are serialized by the hardware, then the “new” value of the current CAS instance will be the “old” value of the next CAS. On the other hand, if CASs are serialized by hardware, the next CAS will not be able to know its “old” value before the previous CAS finishes (at which time the after-value of the previous CAS is released to the cache hierarchy and acquired by the next CAS).
This paper breaks the serialization effect of CAS instructions described above using two techniques. First, a special piece of hardware is added both per-core and per-directory to serialize the execution of CAS instructions globally, reducing the chance of CAS failures due to contending threads. Second, instead of using cache coherent memory as the only channel for passing values from a core to another (particularly, the after-value of CAS), a new eager forwarding channel is added, the purpose of which is to pass the after-value of a previous CAS to the next core executing a CAS before the first CAS is even executed. By doing so, the second CAS-pattern (i.e. from the reading of the head variable to the CAS) can be executed speculatively, only to be validated later after the first CAS instruction has committed with a non-speculative after-value, overlapping these CAS-patterns which results in increased parallelism. We describe these two mechanisms in details below.
The first component in CASPAR is efficient hardware detection and serialization of contending CAS instructions. To acheive this, every core is extended with a small fully associative buffer that records operand addresses of failed CAS instructions. A new adddress is inserted into this buffer when the number of failed CAS instructions exceed a certain threshold on this core during a time period. Entries are removed if they have not been accessed by CAS for a while. Then, for every load instruction, this buffer is searched to see whether a load accessed an address recorded in the buffer. If true, the load is identified as the beginning of a CAS-pattern, called a triggering load (TL), and the address is identified as the target address of the current ongoing CAS-pattern, called Active CAS (AC), which is stored in a register. The AC register will be cleared after the commit of a CAS instruction on that address. There can be only one AC at any moment during execution. The purpose of isentifying TL is that the core will acquire the cache line accessed by the TL in exclusive mode, expecting a CAS to later update the line. Before the AC is cleared, the TL will be held in the cache for an extended period, and the cache controller will reject all requests for the line from othe cores. To avoid deadlock, the cache controller only holds the line for a limited number of cycles. The line will be released after a timeout to allow other processors proceeding with the line.
On the directory side, when a read-exslusive request is received by the directory, it adds the identity of the core into a hardware queue, which buffers all current requestors for the same line (there can be multiple queues, each for a certain address). Requests are not served before the current owner of the line completes its CAS and releases ownership of the line, in which case the directory removes the current owner from the head of the queue, and forwards the line to the next processor which becomes the new queue head.
CASPAR further extends the above architecture to support speculative execution of CAS-patterns by adding the second hardware component for eager forwarding of CAS after-values and post-CAS validation. When a CAS insrruction is added into the ROB, we collect its operands as soon as possible when the operands are ready, and then send the “new” operand of CAS to the directory via a special forwarding message. The directory will save this forwarded value (in word granularity) into the corresponding entry of the hardware queue, and in the meantime, transfers the value also to the next entry in the queue, if there is one. On receiving the forwarded value from the directory, the core stalling on the acquisition of the variable (since it is in the hardware queue) will unblock and enters speculative mode using the forwarded value as the value of the CAS variable. During speculation, any data conflict, cache overflow, exception or other unrecoverable events will cause speculation to fail, after which the core will fall back to serial execution of CAS. The speculation commits when the previous core’s CAS completes, at which time the cache line is released. In this case, the directory will forward the cache line to the next core in the queue. The receiver of the cache line will validate the execution by comparing the earlier forwarded value with the actual value in the cache line (the offset is contained in the forwarded message). If these two values match, speculation commits. Otherwise, it will abort and retry from the TL instruction. Note that although the execution of CAS is serialized, in which situation it seems unnecessary to validate against the actual CAS’ed value against the forwarded value, in fact, the forwarded value does not always match the final CAS’ed result, due to factors such as branch misspeculation. The paper notes that in most cases the forwarded value passes the validation, resulting in low abort rate and increased throughput.
The paper at the end also proposes parallel validation, which enables multiple pending speculations to be validated in parallel rather than having the previous CAS sending the non-speculative value to the next CAS, unnecessarily serializing the validation process. Parallel validation works as follows. First, recall that when a core forwards the after-value of its CAS to the directory, the value will be saved into the corresponding entry of the queue. Second, the necessary condition for a speculation to be committed is that: (1) the previous CAS also commits, because otherwise the forwarded after-value is produced out of nowhere; (2) the current CAS must have used the correct value, i.e. the forwarded value must agree with the after-value in the memory after the CAS has completed. Instead of letting the receiver of the cache line validate its speculation, the parallel validation algorithm follows a Two-Phase Commit protocol to distributedly validate on the sender side. To elaborate: Since the directory knows the speculative after-value of every core in the queue, it sends a “prepare” message to each core in the queue with the speculative after-value of that core. The core performs a local validate by checking (1) whether its CAS has completed; and (2) whether the after-value of the CAS matches the value in the “prepare” message. For cores that pass the check, they stop processing instructions and coherence messages, and enter a quiescent state to avoid unexpected abort. Then these cores reply to the directory indicating that they have passed the test. For those that failed the test, they can execute normally and simply reply check failed. The directory, upon receiving all replies, commits core P(i+1), if P(i) has passed validation and P(i+1) itself have finished the CAS. Other cores are not affected (the paper does not mention how to abort in this case) except that they unblock (if blocked) and resume execution. They will be validated either serially, or in the later batchs.