Load-link/store-conditional

Last updated January 10, 2024

In computer science, load-linked/store-conditional^[1] (LL/SC), sometimes known as load-reserved/store-conditional^[2] (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

Comparison of LL/SC and compare-and-swap

If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored (see ABA problem).

Real implementations of LL/SC do not always succeed even if there are no concurrent updates to the memory location in question. Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail. Older implementations will fail if there are any updates broadcast over the memory bus. This is called weak LL/SC by researchers, as it breaks many theoretical LL/SC algorithms.^[5] Weakness is relative, and some weak implementations can be used for some algorithms.

LL/SC is more difficult to emulate than CAS. Additionally, stopping running code between paired LL/SC instructions, such as when single-stepping through code, can prevent forward progress, making debugging tricky.^[6]

Nevertheless, LL/SC is equivalent to CAS in the sense that either primitive can be implemented in terms of the other, in O(1) and in a wait-free manner.^[7]

Implementations

LL/SC instructions are supported by:

Alpha: ldl_l/stl_c and ldq_l/stq_c
PowerPC/Power ISA: lwarx/stwcx and ldarx/stdcx^[8]^[9]
MIPS: ll/sc^[10] and lld/scd^[11]
ARM: ldrex/strex (ARMv6,^[12] v7 and v8-M^[13]), and ldxr/stxr (ARMv8-A^[14])
RISC-V: lr/sc^[2]
ARC: LLOCK/SCOND

Some CPUs^{[ which? ]} require the address being accessed exclusively to be configured in write-through mode.

Typically, CPUs track the load-linked address at a cache-line or other granularity, such that any modification to any portion of the cache line (whether via another core's store-conditional or merely by an ordinary store) is sufficient to cause the store-conditional to fail.

All of these platforms provide weak^{[ clarification needed ]} LL/SC. The PowerPC implementation allows an LL/SC pair to wrap loads and even stores to other cache lines (although this approach is vulnerable to false cache line sharing). This allows it to implement, for example, lock-free reference counting in the face of changing object graphs with arbitrary counter reuse (which otherwise requires double compare-and-swap, DCAS). RISC-V provides an architectural guarantee of eventual progress for LL/SC sequences of limited length.

Some ARM implementations define platform dependent blocks, ranging from 8 bytes to 2048 bytes, and an LL/SC attempt in any given block fails if there is between the LL and SC a normal memory access inside the same block. Other ARM implementations fail if there is a modification anywhere in the whole address space. The former implementation is the stronger and most practical.

LL/SC has two advantages over CAS when designing a load–store architecture: reads and writes are separate instructions, as required by the design philosophy (and pipeline architecture); and both instructions can be performed using only two registers (address and value), fitting naturally into common 2-operand ISAs. CAS, on the other hand, requires three registers (address, old value, new value) and a dependency between the value read and the value written. x86, being a CISC architecture, does not have this constraint; though modern chips may well translate a CAS instruction into separate LL/SC micro-operations internally.

Extensions

Hardware LL/SC implementations typically do not allow nesting of LL/SC pairs.^[15] A nesting LL/SC mechanism can be used to provide a MCAS primitive (multi-word CAS, where the words can be scattered).^[16] In 2013, Trevor Brown, Faith Ellen, and Eric Ruppert implemented in software a multi-address LL/SC extension (which they call LLX/SCX) that relies on automated code generation;^[17] they have used it to implement one of the best-performing concurrent binary search tree (actually a chromatic tree), slightly beating the JDK CAS-based skip list implementation.^[18]

Related Research Articles

Alpha is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set computers (CISC) and to be a highly competitive RISC processor for Unix workstations and similar markets.

MIPS is a family of reduced instruction set computer (RISC) instruction set architectures (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.

In computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set computer (CISC), a RISC computer might require more instructions in order to accomplish a task because the individual instructions are written in simpler code. The goal is to offset the need to process more instructions by increasing the speed of each instruction, in particular by implementing an instruction pipeline, which may be simpler to achieve given simpler instructions.

In computer science, an instruction set architecture (ISA) is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

The extended list can be re-expressed as a sequential history.
That sequential history is a subset of the original unextended list.

The AT&T Hobbit is a microprocessor design that AT&T Corporation developed in the early 1990s. It was based on the company's CRISP design resembling the classic RISC pipeline, and which in turn grew out of the C Machine design by Bell Labs of the late 1980s. All were optimized for running code compiled from the C programming language. The design concentrates on fast instruction decoding, indexed array access, and procedure calls.

Berkeley RISC is one of two seminal research projects into reduced instruction set computer (RISC) based microprocessor design taking place under the Defense Advanced Research Projects Agency VLSI Project. RISC was led by David Patterson at the University of California, Berkeley between 1980 and 1984. The other project took place a short distance away at Stanford University under their MIPS effort starting in 1981 and running until 1984.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

Double compare-and-swap is an atomic primitive proposed to support certain concurrent programming techniques. DCAS takes two not necessarily contiguous memory locations and writes new values into them only if they match pre-supplied "expected" values; as such, it is an extension of the much more popular compare-and-swap (CAS) operation.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

In computer science, read–modify–write is a class of atomic operations that both read a memory location and write a new value into it simultaneously, either with a completely new value or some function of the previous value. These operations prevent race conditions in multi-threaded applications. Typically they are used to implement mutexes or semaphores. These atomic operations are also heavily used in non-blocking synchronization.

In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and the read value being the same twice is used to conclude that nothing has happened in the interim; however, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking nothing has changed even though the second thread did work that violates that assumption.

Transactional Synchronization Extensions (TSX), also called Transactional Synchronization Extensions New Instructions (TSX-NI), is an extension to the x86 instruction set architecture (ISA) that adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. According to different benchmarks, TSX/TSX-NI can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).

RISC-V is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Unlike most other ISA designs, RISC-V is provided under royalty-free open-source licenses. Many companies are offering or have announced RISC-V hardware; open source operating systems with RISC-V support are available, and the instruction set is supported in several popular software toolchains.

A non-blocking linked list is an example of non-blocking data structures designed to implement a linked list in shared memory using synchronization primitives:

Processor Consistency is one of the consistency models used in the domain of concurrent computing.

References

1 2 "S-1 project". Stanford Computer Science wiki. 2018-11-30.
1 2 3 Andrew Waterman; Krste Asanović, eds. (2017-05-07). "7.2 Load-Reserved/Store-Conditional Instructions". The RISC-V Instruction Set Manual, Volume 1: User-Level ISA, Version 2.2 (PDF).
↑ US20030217115A1,Rowlands, Joseph,"Load-linked/store conditional mechanism in a CC-NUMA system",issued 2003-11-20
↑ Herlihy, Maurice (1993-11-01). "A methodology for implementing highly concurrent data objects". ACM Transactions on Programming Languages and Systems. 15 (5): 745–770. doi:10.1145/161468.161469. ISSN 0164-0925.
↑ Beckmann, Nathan. "Synchronization" (PDF). 15-740: Computer Architecture, Fall 2018. Carnegie Mellon University. Retrieved 23 April 2021.
↑ Keno Fischer (2020-05-02). "Julia 1.5 Feature Preview: Time Traveling (Linux) Bug Reporting" . Retrieved 2020-05-14.
↑ James H. Anderson; Mark Moir (1995). "Universal constructions for multi-object operations". PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. ACM. pp. 184–193. doi:10.1145/224964.224985. ISBN 0-89791-710-3. S2CID 8204331. See their Table 1, Figures 1 & 2 and Section 2 in particular.
↑ May, Cathy; Silha, Ed; Simpson, Eick; Warren, Hank (1993). The PowerPC architecture: A SPECIFICATION FOR A NEW FAMILY OF RISC PROCESSORS. Morgan Kaufmann PUblishers, Inc. pp. 336–338, 465. ISBN 1-55860-316-6.
↑ Kacmarcik, Cary (1995). Optimizing PowerPC Code. Addison-Wesley Publishing Company. pp. 71–72. ISBN 0-201-40839-2.
↑ "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 9. Retrieved 2023-12-27.
↑ "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 5. Retrieved 2023-12-27.
↑ "ARM11 MPCore™ Processor Revision: r2p0 Technical Reference Manual". p. 301-302(8-7,8-8). Retrieved 2023-12-14.
↑ "Arm®v8-M Architecture Reference Manual". p. 278. Retrieved 2023-12-14.
↑ "ARMv8-A Synchronization primitives". p. 6. Retrieved 2023-12-14.
↑ James R. Larus; Ravi Rajwar (2007). Transactional Memory. Morgan & Claypool. p. 55. ISBN 978-1-59829-124-7.
↑ Keir Fraser (February 2004). Practical lock-freedom (PDF) (Technical report). University of Cambridge Computer Laboratory. p. 20. UCAM-CL-TR-579.
↑ Brown, Trevor; Ellen, Faith; Ruppert, Eric (2013). "Pragmatic primitives for non-blocking data structures" (PDF). PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing. ACM. pp. 13–22. arXiv: 1712.06688 . doi:10.1145/2484239.2484273. ISBN 978-1-4503-2065-8. S2CID 6537417. See also slides
↑ Trevor Brown; Faith Ellen; Eric Ruppert (2014). "A general technique for non-blocking trees" (PDF). PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. pp. 329–342. arXiv: 1712.06687 . doi:10.1145/2555243.2555267. ISBN 978-1-4503-2656-8. S2CID 9442380.

Jensen, Eric H.; Hagensen, Gary W.; Broughton, Jeffrey M. (November 1987). A New Approach to Exclusive Data Access in Shared Memory Multiprocessors (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97663. Archived from the original (PDF) on 2017-02-02. Retrieved 2012-02-22.
Bruner, John D.; Hagensen, Gary W.; Jensen, Eric H.; Pattin, Jay C.; Broughton, Jeffrey M. (11 November 1987). Cache Coherence on the S-1 AAP (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97646. Archived from the original (PDF) on 2 February 2017. Retrieved 10 November 2013.
Detlefs, D.; Martin, P.; Moir, M.; Steele, Jr., Guy L. (2001). "Lock-free reference counting". PODC '01 Proceedings of the twentieth annual ACM symposium on Principles of distributed computing. ACM. pp. 190–9. CiteSeerX 10.1.1.92.8221 . doi:10.1145/383962.384016. ISBN 1-58113-383-9.
Reinholtz, Kirk (December 2004). "Atomic Reference Counting Pointers". C/C++ Users Journal .
Sites, R. L. (February 1993). "Alpha AXP architecture". Comm. ACM. 36 (2): 33–44. doi: 10.1145/151220.151226 . S2CID 5473184.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[S-1-1] 1 2 "S-1 project". Stanford Computer Science wiki. 2018-11-30.

[RISCV-2] 1 2 3 Andrew Waterman; Krste Asanović, eds. (2017-05-07). "7.2 Load-Reserved/Store-Conditional Instructions". The RISC-V Instruction Set Manual, Volume 1: User-Level ISA, Version 2.2 (PDF).

[3] US20030217115A1,Rowlands, Joseph,"Load-linked/store conditional mechanism in a CC-NUMA system",issued 2003-11-20

[4] Herlihy, Maurice (1993-11-01). "A methodology for implementing highly concurrent data objects". ACM Transactions on Programming Languages and Systems. 15 (5): 745–770. doi:10.1145/161468.161469. ISSN 0164-0925.

[5] Beckmann, Nathan. "Synchronization" (PDF). 15-740: Computer Architecture, Fall 2018. Carnegie Mellon University. Retrieved 23 April 2021.

[Fischer2020-6] Keno Fischer (2020-05-02). "Julia 1.5 Feature Preview: Time Traveling (Linux) Bug Reporting" . Retrieved 2020-05-14.

[Anderson1995-7] James H. Anderson; Mark Moir (1995). "Universal constructions for multi-object operations". PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. ACM. pp. 184–193. doi:10.1145/224964.224985. ISBN 0-89791-710-3. S2CID 8204331. See their Table 1, Figures 1 & 2 and Section 2 in particular.

[8] May, Cathy; Silha, Ed; Simpson, Eick; Warren, Hank (1993). The PowerPC architecture: A SPECIFICATION FOR A NEW FAMILY OF RISC PROCESSORS. Morgan Kaufmann PUblishers, Inc. pp. 336–338, 465. ISBN 1-55860-316-6.

[9] Kacmarcik, Cary (1995). Optimizing PowerPC Code. Addison-Wesley Publishing Company. pp. 71–72. ISBN 0-201-40839-2.

[10] "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 9. Retrieved 2023-12-27.

[11] "APPLICATION NOTE MIPS R4000 Synchronization Primitives" (PDF). p. 5. Retrieved 2023-12-27.

[12] "ARM11 MPCore™ Processor Revision: r2p0 Technical Reference Manual". p. 301-302(8-7,8-8). Retrieved 2023-12-14.

[13] "Arm®v8-M Architecture Reference Manual". p. 278. Retrieved 2023-12-14.

[14] "ARMv8-A Synchronization primitives". p. 6. Retrieved 2023-12-14.

[LarusRajwar2007-15] James R. Larus; Ravi Rajwar (2007). Transactional Memory. Morgan & Claypool. p. 55. ISBN 978-1-59829-124-7.

[Fraser2004-16] Keir Fraser (February 2004). Practical lock-freedom (PDF) (Technical report). University of Cambridge Computer Laboratory. p. 20. UCAM-CL-TR-579.

[Brown2013-17] Brown, Trevor; Ellen, Faith; Ruppert, Eric (2013). "Pragmatic primitives for non-blocking data structures" (PDF). PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing. ACM. pp. 13–22. arXiv: 1712.06688 . doi:10.1145/2484239.2484273. ISBN 978-1-4503-2065-8. S2CID 6537417. See also slides

[Brown2014-18] Trevor Brown; Faith Ellen; Eric Ruppert (2014). "A general technique for non-blocking trees" (PDF). PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. pp. 329–342. arXiv: 1712.06687 . doi:10.1145/2555243.2555267. ISBN 978-1-4503-2656-8. S2CID 9442380.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]