A Practical Multi-Word Compare-And-Swap Operation

- Hide Paper Summary

Paper Title: A Practical Multi-Word Compare-And-Swap Operation
Link: https://dl.acm.org/citation.cfm?id=676137
Year: DISC 2002
Keyword: MWCAS

Back

**Highlights: **

I like it when the paper mentions that the RDCSS and MWCAS are basically non-blocking variants of 2PL. This helps a lot in comprehending the algorithm at a high level.
The “reserve bit in a pointer” thing is not invented by this paper, but the paper did give a strong argument on why this is applicable for both pointers and numerical valued variables.

**Questionss: **

Correctness proof is somehow overcomplicated. I prefer the one I used in this summary which reduces the paper’s algorithm to 2PL.
The paper does not give a concrete reason (e.g. an example) why RDCSS should be used instead of single word CAS.

This paper presents an algorithm for implementing Multi-Word Compare-and-Swap (MWCAS). MWCAS is useful as a basic primitive in constructing lock-free algorithms, as an enhancement to the already-present single word CAS on most platforms. Compared with previous proposals, this paper’s MWCAS implementation has the following features that make it attractive to data structure designers. First, this paper’s design is non-blocking, which means that threads will always make progress by helping each other when they observe a partially finished MVCAS. This can help avoid certain pathologies with lock-based programming. Second, MVCAS in this paper is disjoint-access-parallel (D-A Parallel), meaning that accesses on disjoint data can proceed in parallel with blocking each other. This avoids unnecessary serialization of operations when they are not operating on the same data items. Third, the MWCAS requires no extra metadata except a pool of descriptors, and only reserves a few bits per word. This makes it generally applicable to pointer-based data structures as most pointers nowadays have a few redundant bits that can be used to hold the metadata. Lastly, the MWCAS design only relies on a single hardware primitive, Compare-and-Swap, which makes it portable among different architectures.

The MWCAS algorithm is based on a more basic primitive, restricted double compare and single swap (RDCSS). RDCSS is an atomic primitive that checks two locations for expected values, and swaps the second location with the new value if the comparison results in success. Otherwise, the location is not updated as if the operation has not been executed. RDCSS can be expressed in C language in the following way:

uint64_t RDCSS(uint64_t *a1, uint64_t *a2, uint64_t o1, uint64_t o2, uint64_t n2) {
  uint64_t a2_local = *a2;
  if(*a1 == o1 && a2_local == o2) *a2 = n2;
  return a2_local;
}

This procedure, however, is non-atomic. Imagine what will happen if another thread writes a1 and reads a2 after the “if” condition check, and after the branch is executed (assuming that the condition is true). In this case, the RDCSS sequence is serialized before the “write a1” operation by not observing its update, and in the meantime serialized after “read a2” by overwriting the value it just observed. This implies that the RDCSS sequence must not be atomic, since otherwise, the atomic point will be both before “write a1” and “read a2”, which is impossible, since “write a1” happens before “read a2” in the program order. In other words, the other thread executing the write and read on a1, a2 observes partially updated state by the RDCSS function (such observation does not necessarily involve reading a value; writing into its read set is also an implicit way of synchronization, i.e. anti-dependency), which violates the definition of atomicity.

To ensure atomicity of RDCSS, we may use an existing algorithm, Two-Phase Locking, that generally works for any read-write sequence. The 2PL scheme works by acquiring locks before accessing (reading and/writing) data items. The access must be wrapped between the lock and unlock pair. One additional requirement of 2PL is that once the thread starts releasing a lock, no new lock shall be acquired. We prove the atomicity of 2PL by contradiction. Assume another transaction T2 is serialized both before and after the 2PL transaction T1. Then it must be serialized both before and after the transaction by any of the RAW, WAR or WAW. We can then deduce that the transaction T2 must have accessed a data item A before T1 does, and another data item B (which can be the same item or a different one) after T1 does. According to the locking principle, if T2 accesses A, and T1 accesses it later, then it must be that T2 releases the lock on A, and T1 acquires the lock. We can then deduce that before T1 accesses A, T2 must have already entered the shrinking phase of 2PL. If we look at the after sequence, however, T2 accessed B after T1 does. This implies that when T1 accesses B, it is unlocked, since otherwise T1 will have to wait for the lock. Then we know that T2 acquired the lock on B at some time point after T1 accesses B. We have the following ordering:

(1) T2 unlock A < T1 lock A < T1 access A < T1 unlock A;

(2) T1 lock B < T1 access B < T1 unlock B < T2 lock B.

Since T1 obeys 2PL protocol, which demands that T1 must not lock any item after any item is unlocked, we know that T1 lock A must happen before T1 unlock B, and here concludes out proof, since by adding T1 lock A < T1 unlock B, we have derived T2 unlock A < T2 lock B, which violates the 2PL rule that no lock release shall happen before a lock acquisition. A contradiction!

By proving that 2PL is correct, we now can rewrite the RDCSS in the following manner:

uint64_t RDCSS(uint64_t *a1, uint64_t *a2, uint64_t o1, uint64_t o2, uint64_t n2) {
  lock(a2);
  uint64_t a2_local = *a2;  // Access protected by lock on a2
  lock(a1);
  uint64_t a1_local = *a1;  // Access protected by lock on a1
  unlock(a1);
  if(a1_local == o1 && a2_local == o2) *a2 = n2; // Protected by lock on a2
  unlock(a2);
  return a2_local;
}

Correspondingly, all reads and writes to address a1 and a2 must be synchronized using the same locks on both addresses. Since 2PL guarantees atomicity of transactions, the implementation of RDCSS is atomic.

Using locks solves the correctness problem, but has a large performance overhead, due to the fact that locks are not scalable in a multicore environment. The algorithm seeks to optimize this base algorithm as follows. First, the lock and unlock around a1 read can be removed, given that reading a single word is always atomic (which is always the case on x86 for word aligned read). This lock-unlock pair is unnecessary, since any operation can still occur right before and after the lock and unlock, respectively, which cannot be distinguished by the version with lock (i.e. write a2, lock a2, read a2, unlock a2 is exactly the same as write a2, read a2; Similar reasoning applies to writing a2 after unlocking it). Similarly, reading and writing a1 no longer needs locking and unlocking around the access.

After removing the lock on a1, we now have a single lock-unlock pair wrapping around this function body. To get rid of this lock and unlock pair, the paper proposes using “descriptors” to avoid blocking threads trying to access the data item under update. This mechanism works as follows. First, in order to notify threads that a RDCSS is being conducted on the data item, the RDCSS procedure first prepares a descriptor, which is a small memory block that records the type of the descriptor, the two addresses, two old values, and the new value to be swapped if comparisons are successful. The descriptor is allocated and initialized with the metadata of the RDCSS before it is linked into the data structure. Next, the thread initiating the RDCSS installs the pointer to the descriptor to the target word (address a2 in our example) using a single word CAS against the old value (i.e. o2 in the example). This is the serialization point between threads conducting the RDCSS and threads reading the value of the target word if the RDCSS succeeds (if it fails then there is no serialization point, since the failed RDCSS does not change global state). The paper assumes that there is some way for threads to distinguish between a regular value in the data structure from a pointer to the descriptor. On x86 platforms, due to the fact that pointers have many redundant bits because of address alignment and 48-bit virtual address, we can reserve a few bits in these pointers to indicate whether the pointer value represents a normal value or descriptor’s address. For numerical values and bit fields, we may reserve a few higher order bits explicitly (most numerical values will not have a chance to use up all 64 bits).

The RDCSS descriptor posted to the target word serves two purposes. First, threads accessing the word will not be able to proceed, because the descriptor is essentially an exclusive lock on the target word indicating an ongoing RDCSS, as shown in the previous example. Second, instead of letting worker threads wait on this RDCSS record, the presence of metadata in the descriptor block further allows threads to “help-along” each other by completing the RDCSS operation when they observe this descriptor. The RDCSS operation is completed by simply comparing the value of address a1 against the value of o1 in the descriptor, and CAS’ing the new value n2 into address a2 if comparison succeeds, or CAS’ing the old value o2 into a2 if comparison fails. Note that we know the old value on a2 must be o2, since otherwise, the previous CAS will not succeed.

The “help-along” protocol described in the previous paragraph introduces a data race when multiple threads (including the initiator thread) attempt to complete the RDCSS operation. This data race, however, does not affect correctness, since as soon as the first thread CAS’ed out the descriptor pointer from the target word, as a result of a successful or failed comparison, all later threads attempting to complete the RDCSS opreation would fail the CAS, which introduces no global state change and is therefore safe. In other words, the “help-along” protocol in this case ensures that the global state change happens exactly once by using CAS to switching out the descriptor pointer in the last step of the RDCSS operation. This “excat once” semantics is critical to the correctness of “help-along” protocol, as we will see in the full MWCAS protocol.

The full MWCAS protocol works in a similar way as the RDCSS protocol. It follows the strict 2PL protocol which we describe as follows. First, a descriptor is installed to each of the target locations to be compared and swapped. This process is akin to locking these locations for update in the 2PL protocol. The descriptor pointer is installed using RDCSS on both the target word (for CAS) and the status word within the descriptor (explained later), with the expected old value as old, and descriptor pointer as new. Similar to the RDCSS protocol, we need to dedicate another bit in the target word to indicate that the word stores a MWCAS pointer rather than a regular value. This process is repeated for every target word in the descriptor with the old values. Should any of these CAS fail, the MWCAS is considered to be failed, and the contents of the target words will be restored to their original values using CAS. Note that using CAS for value restoration is mandatory, as other threads may have completed the value restoration process.

The MWCAS descriptor consists of an array of CAS entries and a status word. The CAS entry contains the target word address and the old and new values for the single word CAS. The status word can take three values: Undetermined, Failed, and Completed. When it is first initialized, the status is set to Undetermined to indicate that the first stage installation is still under progress, and the completion of the MWCAS is not guaranteed. After the first stage installation (i.e. after all CASes have completed successfully), this word is switched to Completed using a CAS. Note again that using CAS is mandatory since multiple threads may attempt to change the status word concurrently.

One subtle data race might happen during the first stage if care is not taken. When multiple threads are attempting to install the descriptor on a target word a1 with old value o1. If only single word CAS is used to check whether the value of a1 is currenrly o1, data corruption might occur, if another thread has already finished the current MWCAS1, initiated another MWCAS (MWCAS2) that changes the value of a1 back to o1. In this case, even though MWCAS1 has been finished by the aforementioned thread, other threads are not aware of this, and will keep posting the descriptor pointer of MWCAS1 on address a1, potentially making MWCAS1 non-atomic (if MWCAS2 reads the after-value of MWCAS1, then MWCAS1 is serialized both before MWCAS2 by letting MWCAS2 observing its value, and after MWCAS2 by overwriting a value written by MWCAS2). This is essentially the well-known ABA problem, which can be prevented on pointers by delaying memory reclaimation (i.e. delay reusing a memory address) until all threads that can possibly observe that pointer enters quiescent state. For numerical variables, however, this is inevitable as numerical variables can take any value at any moment. As a universal solution, the paper proposed using RDCSS to install the descriptor, rather than using single word CAS. The RDCSS checks two conditions: (1) Whether the current status of the MWCAS descriptor is still Undetermined; and (2) whether the current value at address ax matches the old value ox. By using RDCSS, if the MWCAS has been finished by another thread, the RDCSS would fail for all threads attempting to install the descriptor on the target word, since the status word has already been changed to Completed. Note that the same ABA problem will not occur for the second stage in which the descriptor pointer on each target word is restored to the old value or updated to the new value. This is because in the second stage, the CAS for switching values uses the descriptor pointer as old value, which is guaranteed to be unique during the current session (i.e. the reclamation of the descriptor is delayed).