Double compare-and-swap

Last updated May 26, 2025

Double compare-and-swap (DCAS or CAS2) is an atomic primitive proposed to support certain concurrent programming techniques. DCAS takes two not necessarily contiguous memory locations and writes new values into them only if they match pre-supplied "expected" values; as such, it is an extension of the much more popular compare-and-swap (CAS) operation.

DCAS is sometimes confused with the double-width compare-and-swap (DWCAS) implemented by instructions such as x86 CMPXCHG16B. DCAS, as discussed here, handles two discontiguous memory locations, typically of pointer size, whereas DWCAS handles two adjacent pointer-sized memory locations.

In his doctoral thesis, Michael Greenwald recommended adding DCAS to modern hardware, showing it could be used to create easy-to-apply yet efficient software transactional memory (STM). Greenwald points out that an advantage of DCAS vs CAS is that higher-order (multiple item) CASn can be implemented in O(n) with DCAS,^[1] but algorithms for DCAS that use only unary single-word atomic operations are sensitive to the number of contending processes.^[2]

One of the advantages of DCAS is the ability to implement atomic deques (i.e. doubly linked lists) with relative ease.^[3] More recently, however, it has been shown that an STM can be implemented with comparable properties^{[ clarification needed ]} using only CAS.^[4] An lock-free deque using hazard pointers and requiring only DWCAS rather than full DCAS was proposed by Maged Michael in 2003.^[5] In general however, DCAS is not a silver bullet: implementing lock-free and wait-free algorithms using it can be just as complex and error-prone as for CAS.^[6]

Motorola at one point included DCAS in the instruction set for its 68k series;^[7] however, the slowness of DCAS relative to other primitives (apparently due to cache handling issues) led to its avoidance in practical contexts.^[8]As of 2025^[update], DCAS is not natively supported by any widespread CPUs in production.

The generalization of DCAS to more than two addresses is sometimes called MCAS (multi-word CAS); MCAS can be implemented by a nestable LL/SC, but such a primitive is not directly available in hardware.^[4] MCAS can be implemented in software in terms of DCAS, in various ways.^[9] In 2013, Trevor Brown, Faith Ellen, and Eric Ruppert have implemented in software a multi-address LL/SC extension (which they call LLX/SCX) that while being more restrictive than MCAS^[10] enabled them, via some automated code generation, to implement one of the best performing concurrent binary search tree (actually a chromatic tree), slightly beating the JDK CAS-based skip list implementation.^[11]

In general, DCAS can be provided by a more expressive hardware transactional memory.^[12] IBM POWER8 and Intel Intel TSX provide working implementations of transactional memory. Sun's cancelled Rock processor would have supported it as well.

References

↑ M. Greenwald. "Non-Blocking Synchronization and System Design". Stanford University Technical Report STAN-CS-TR-99-1624 . (p. 10 in particular)
↑ Attiya, Hagit; Dagan, Eyal (September 2001). "Improved implementations of binary universal operations" . Journal of the ACM. 48 (5). ACM: 1013–1037. doi:10.1145/502102.502105.
↑ Ole Agesen, David L. Detlefs, Christine H. Flood, Alexander T. Garthwaite, Paul A. Martin, Mark Moir, Nir N. Shavit, and Guy L. Steele Jr. "DCAS-Based Concurrent Deques." Theory of Computing Systems 35, no. 3 (2002): 349-386.
1 2 Keir Fraser (2004), "Practical lock-freedom" UCAM-CL-TR-579.pdf
↑ Maged M. Michael. Cas-based lock-free algorithm for shared deques. In Harald Kosch, László Böszörményi, and Hermann Hellwagner, editors, Euro-Par, volume 2790 of Lecture Notes in Computer Science, pages 651–660.Springer, 2003.
↑ Simon Doherty et al., "DCAS is not a silver bullet for nonblocking algorithm design". 16th annual ACM symposium on Parallelism in algorithms and architectures, 2004, pp. 216–224 .
↑ CAS2
↑ Greenwald, Michael, and David Cheriton. "The synergy between non-blocking synchronization and operating system structure." OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation (1996): 123-136. (particularly section 7.1 "Experimental Implementation")
↑ Harris, Timothy L.; Fraser, Keir; Pratt, Ian A. (2002). A Practical Multi-Word Compare-And-Swap Operation. Proc. Int'l Symp. Distributed Computing. CiteSeerX 10.1.1.13.7938 .
↑ Trevor Brown, Faith Ellen, and Eric Ruppert. "Pragmatic primitives for non-blocking data structures." In Proceedings of the 2013 ACM symposium on Principles of distributed computing, pp. 13-22. ACM, 2013.
↑ Trevor Brown, Faith Ellen, and Eric Ruppert. "A general technique for non-blocking trees." In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 329-342. ACM, 2014.
↑ Dave Dice, Yossi Lev, Mark Moir, Dan Nussbaum, and Marek Olszewski. (2009) "Early experience with a commercial hardware transactional memory implementation." Sun Microsystems technical report (60 pp.) SMLI TR-2009-180. A short version appeared at ASPLOS’09 doi : 10.1145/1508244.1508263. The full-length report discusses how to implement DCAS using HTM in section 5.

External links

US Patent 4584640 Method and apparatus for a compare and swap instruction

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] M. Greenwald. "Non-Blocking Synchronization and System Design". Stanford University Technical Report STAN-CS-TR-99-1624 . (p. 10 in particular)

[2] Attiya, Hagit; Dagan, Eyal (September 2001). "Improved implementations of binary universal operations" . Journal of the ACM. 48 (5). ACM: 1013–1037. doi:10.1145/502102.502105.

[3] Ole Agesen, David L. Detlefs, Christine H. Flood, Alexander T. Garthwaite, Paul A. Martin, Mark Moir, Nir N. Shavit, and Guy L. Steele Jr. "DCAS-Based Concurrent Deques." Theory of Computing Systems 35, no. 3 (2002): 349-386.

[Fraser2004-4] 1 2 Keir Fraser (2004), "Practical lock-freedom" UCAM-CL-TR-579.pdf

[Maged2006-5] Maged M. Michael. Cas-based lock-free algorithm for shared deques. In Harald Kosch, László Böszörményi, and Hermann Hellwagner, editors, Euro-Par, volume 2790 of Lecture Notes in Computer Science, pages 651–660.Springer, 2003.

[6] Simon Doherty et al., "DCAS is not a silver bullet for nonblocking algorithm design". 16th annual ACM symposium on Parallelism in algorithms and architectures, 2004, pp. 216–224 .

[7] CAS2

[8] Greenwald, Michael, and David Cheriton. "The synergy between non-blocking synchronization and operating system structure." OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation (1996): 123-136. (particularly section 7.1 "Experimental Implementation")

[9] Harris, Timothy L.; Fraser, Keir; Pratt, Ian A. (2002). A Practical Multi-Word Compare-And-Swap Operation. Proc. Int'l Symp. Distributed Computing. CiteSeerX 10.1.1.13.7938 .

[10] Trevor Brown, Faith Ellen, and Eric Ruppert. "Pragmatic primitives for non-blocking data structures." In Proceedings of the 2013 ACM symposium on Principles of distributed computing, pp. 13-22. ACM, 2013.

[11] Trevor Brown, Faith Ellen, and Eric Ruppert. "A general technique for non-blocking trees." In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 329-342. ACM, 2014.

[12] Dave Dice, Yossi Lev, Mark Moir, Dan Nussbaum, and Marek Olszewski. (2009) "Early experience with a commercial hardware transactional memory implementation." Sun Microsystems technical report (60 pp.) SMLI TR-2009-180. A short version appeared at ASPLOS’09 doi : 10.1145/1508244.1508263. The full-length report discusses how to implement DCAS using HTM in section 5.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]