Spinlock

Last updated

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on (the one that holds the lock) blocks or "goes to sleep".

Contents

Because they avoid overhead from operating system process rescheduling or context switching, spinlocks are efficient if threads are likely to be blocked for only short periods. For this reason, operating-system kernels often use spinlocks. However, spinlocks become wasteful if held for longer durations, as they may prevent other threads from running and require rescheduling. The longer a thread holds a lock, the greater the risk that the thread will be interrupted by the OS scheduler while holding the lock. If this happens, other threads will be left "spinning" (repeatedly trying to acquire the lock), while the thread holding the lock is not making progress towards releasing it. The result is an indefinite postponement until the thread holding the lock can finish and release it. This is especially true on a single-processor system, where each waiting thread of the same priority is likely to waste its quantum (allocated time where a thread can run) spinning until the thread that holds the lock is finally finished.

Implementing spinlocks correctly is challenging because programmers must take into account the possibility of simultaneous access to the lock, which could cause race conditions. Generally, such an implementation is possible only with special assembly language instructions, such as atomic (i.e. un-interruptible) test-and-set operations and cannot be easily implemented in programming languages not supporting truly atomic operations. [1] On architectures without such operations, or if high-level language implementation is required, a non-atomic locking algorithm may be used, e.g. Peterson's algorithm. However, such an implementation may require more memory than a spinlock, be slower to allow progress after unlocking, and may not be implementable in a high-level language if out-of-order execution is allowed.

Example implementation

The following example uses x86 assembly language to implement a spinlock. It will work on any Intel 80386 compatible processor.

; Intel syntaxlocked:; The lock variable. 1 = locked, 0 = unlocked.dd0spin_lock:moveax,1; Set the EAX register to 1.xchgeax,[locked]; Atomically swap the EAX register with;  the lock variable.; This will always store 1 to the lock, leaving;  the previous value in the EAX register.testeax,eax; Test EAX with itself. Among other things, this will;  set the processor's Zero Flag if EAX is 0.; If EAX is 0, then the lock was unlocked and;  we just locked it.; Otherwise, EAX is 1 and we didn't acquire the lock.jnzspin_lock; Jump back to the MOV instruction if the Zero Flag is;  not set; the lock was previously locked, and so; we need to spin until it becomes unlocked.ret; The lock has been acquired, return to the calling;  function.spin_unlock:xoreax,eax; Set the EAX register to 0.xchgeax,[locked]; Atomically swap the EAX register with;  the lock variable.ret; The lock has been released.

Significant optimizations

The simple implementation above works on all CPUs using the x86 architecture. However, a number of performance optimizations are possible:

On later implementations of the x86 architecture, spin_unlock can safely use an unlocked MOV instead of the slower locked XCHG. This is due to subtle memory ordering rules which support this, even though MOV is not a full memory barrier. However, some processors (some Cyrix processors, some revisions of the Intel Pentium Pro (due to bugs), and earlier Pentium and i486 SMP systems) will do the wrong thing and data protected by the lock could be corrupted. On most non-x86 architectures, explicit memory barrier or atomic instructions (as in the example) must be used. On some systems, such as IA-64, there are special "unlock" instructions which provide the needed memory ordering.

To reduce inter-CPU bus traffic, code trying to acquire a lock should loop reading without trying to write anything until it reads a changed value. Because of MESI caching protocols, this causes the cache line for the lock to become "Shared"; then there is remarkably no bus traffic while a CPU waits for the lock. This optimization is effective on all CPU architectures that have a cache per CPU, because MESI is so widespread. On Hyper-Threading CPUs, pausing with rep nop gives additional performance by hinting to the core that it can work on the other thread while the lock spins waiting. [2]

Transactional Synchronization Extensions and other hardware transactional memory instruction sets serve to replace locks in most cases. Although locks are still required as a fallback, they have the potential to greatly improve performance by having the processor handle entire blocks of atomic operations. This feature is built into some mutex implementations, for example in glibc. The Hardware Lock Elision (HLE) in x86 is a weakened but backwards-compatible version of TSE, and we can use it here for locking without losing any compatibility. In this particular case, the processor can choose to not lock until two threads actually conflict with each other. [3]

A simpler version of the test can use the cmpxchg instruction on x86, or the __sync_bool_compare_and_swap built into many Unix compilers.

With the optimizations applied, a sample would look like:

; In C: while (!__sync_bool_compare_and_swap(&locked, 0, 1)) while (locked) __builtin_ia32_pause();spin_lock:movecx,1; Set the ECX register to 1.retry:xoreax,eax; Zero out EAX, because cmpxchg compares against EAX.XACQUIRElockcmpxchg[locked],ecx; atomically decide: if locked is zero, write ECX to it.;  XACQUIRE hints to the processor that we are acquiring a lock.jeout; If we locked it (old value equal to EAX: 0), return.pause:moveax,[locked]; Read locked into EAX.testeax,eax; Perform the zero-test as before.jzretry; If it's zero, we can retry.repnop; Tell the CPU that we are waiting in a spinloop, so it can;  work on the other thread now. Also written as the "pause".jmppause; Keep check-pausing.out:ret; All done.spin_unlock:XRELEASEmov[locked],0; Assuming the memory ordering rules apply, release the ;  lock variable with a "lock release" hint.ret; The lock has been released.

On any multi-processor system that uses the MESI contention protocol, such a test-and-test-and-set lock (TTAS) performs much better than the simple test-and-set lock (TAS) approach. [4]

With large numbers of processors, adding a random exponential backoff delay before re-checking the lock performs even better than TTAS. [4] [5]

A few multi-core processors have a "power-conscious spin-lock" instruction that puts a processor to sleep, then wakes it up on the next cycle after the lock is freed. A spin-lock using such instructions is more efficient and uses less energy than spin locks with or without a back-off loop. [6]

Alternatives

The primary disadvantage of a spinlock is that, while waiting to acquire a lock, it wastes time that might be productively spent elsewhere. There are two ways to avoid this:

  1. Do not acquire the lock. In many situations it is possible to design data structures that do not require locking, e.g. by using per-thread or per-CPU data and disabling interrupts.
  2. Switch to a different thread while waiting. This typically involves attaching the current thread to a queue of threads waiting for the lock, followed by switching to another thread that is ready to do some useful work. This scheme also has the advantage that it guarantees that resource starvation does not occur as long as all threads eventually relinquish locks they acquire and scheduling decisions can be made about which thread should progress first. Spinlocks that never entail switching, usable by real-time operating systems, are sometimes called raw spinlocks. [7]

Most operating systems (including Solaris, Mac OS X and FreeBSD) use a hybrid approach called "adaptive mutex". The idea is to use a spinlock when trying to access a resource locked by a currently-running thread, but to sleep if the thread is not currently running. (The latter is always the case on single-processor systems.) [8]

OpenBSD attempted to replace spinlocks with ticket locks which enforced first-in-first-out behaviour, however this resulted in more CPU usage in the kernel and larger applications, such as Firefox, becoming much slower. [9] [10]

See also

Related Research Articles

<span class="mw-page-title-main">Accumulator (computing)</span> Register in which intermediate arithmetic and logic results of a CPU are stored

In a computer's central processing unit (CPU), the accumulator is a register in which intermediate arithmetic logic unit results are stored.

A low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture; commands or functions in the language are structurally similar to a processor's instructions. Generally, this refers to either machine code or assembly language. Because of the low abstraction between the language and machine language, low-level languages are sometimes described as being "close to the hardware". Programs written in low-level languages tend to be relatively non-portable, due to being optimized for a certain type of system architecture.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

Peterson's algorithm is a concurrent programming algorithm for mutual exclusion that allows two or more processes to share a single-use resource without conflict, using only shared memory for communication. It was formulated by Gary L. Peterson in 1981. While Peterson's original formulation worked with only two processes, the algorithm can be generalized for more than two.

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computer science and software engineering, busy-waiting, busy-looping or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input or a lock is available. Spinning can also be used to generate an arbitrary time delay, a technique that was necessary on systems that lacked a method of waiting a specific length of time. Processor speeds vary greatly from computer to computer, especially as some processors are designed to dynamically adjust speed based on current workload. Consequently, spinning as a time-delay technique can produce inconsistent or even unpredictable results on different systems unless code is included to determine the time a processor takes to execute a "do nothing" loop, or the looping code explicitly checks a real-time clock.

<span class="mw-page-title-main">Linearizability</span> Property of some operation(s) in concurrent programming

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

  1. The extended list can be re-expressed as a sequential history.
  2. That sequential history is a subset of the original unextended list.
<span class="mw-page-title-main">Pentium F00F bug</span> Pentium processor bug

The Pentium F00F bug is a design flaw in the majority of Intel Pentium, Pentium MMX, and Pentium OverDrive processors. Discovered in 1997, it can result in the processor ceasing to function until the computer is physically rebooted. The bug has been circumvented through operating system updates.

In computer science, a ticket lock is a synchronization mechanism, or locking algorithm, that is a type of spinlock that uses "tickets" to control which thread of execution is allowed to enter a critical section.

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

In computer architecture, the test-and-set CPU instruction is designed to implement mutual exclusion in multiprocessor environments. Although a correct lock can be implemented with test-and-set, the test and test-and-set optimization lowers resource contention caused by bus locking, especially cache coherency protocol overhead on contended locks.

In the x86 architecture, the CPUID instruction is a processor supplementary instruction allowing software to discover details of the processor. It was introduced by Intel in 1993 with the launch of the Pentium and SL-enhanced 486 processors.

In computer science, synchronization is the task of coordinating multiple processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

This article describes the calling conventions used when programming x86 architecture microprocessors.

<span class="mw-page-title-main">Cosmos (operating system)</span> Toolkit for building GUI and command-line based operating systems

C# Open Source Managed Operating System (Cosmos) is a toolkit for building GUI and command-line based operating systems, written mostly in the programming language C# and small amounts of a high-level assembly language named X#. Cosmos is a backronym, in that the acronym was chosen before the meaning. It is open-source software released under a BSD license.

Transactional Synchronization Extensions (TSX), also called Transactional Synchronization Extensions New Instructions (TSX-NI), is an extension to the x86 instruction set architecture (ISA) that adds hardware transactional memory support, speeding up execution of multi-threaded software through lock elision. According to different benchmarks, TSX/TSX-NI can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).

References

  1. Silberschatz, Abraham; Galvin, Peter B. (1994). Operating System Concepts (Fourth ed.). Addison-Wesley. pp. 176–179. ISBN   0-201-59292-4.
  2. "gcc - x86 spinlock using cmpxchg". Stack Overflow.
  3. "New Technologies in the Arm Architecture" (PDF). Archived (PDF) from the original on 2019-04-02. Retrieved 2019-09-26.
  4. 1 2 Maurice Herlihy and Nir Shavit. "The Art of Multiprocessor Programming". "Spin Locks and Contention".
  5. "Boost.Fiber Tuning: Exponential back-off".
  6. John Goodacre and Andrew N. Sloss. "Parallelism and the ARM Instruction Set Architecture". p. 47.
  7. Jonathan Corbet (9 December 2009). "Spinlock naming resolved". LWN.net. Archived from the original on 7 May 2013. Retrieved 14 May 2013.
  8. Silberschatz, Abraham; Galvin, Peter B. (1994). Operating System Concepts (Fourth ed.). Addison-Wesley. p. 198. ISBN   0-201-59292-4.
  9. Ted Unangst (2013-06-01). "src/lib/librthread/rthread.c - Revision 1.71". Archived from the original on 2021-02-27. Retrieved 2022-01-25.
  10. Ted Unangst (2016-05-06). "tedu comment on Locking in WebKit - Lobsters".