Fetch-and-add

Last updated

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

Contents

That is, fetch-and-add performs the operation

increment the value at address x by a, where x is a memory location and a is some value, and return the original value at x.

in such a way that if this operation is executed by one process in a concurrent system, no other process will ever see an intermediate result.

Fetch-and-add can be used to implement concurrency control structures such as mutex locks and semaphores.

Overview

The motivation for having an atomic fetch-and-add is that operations that appear in programming languages as

x = x + a

are not safe in a concurrent system, where multiple processes or threads are running concurrently (either in a multi-processor system, or preemptively scheduled onto some single-core systems). The reason is that such an operation is actually implemented as multiple machine instructions:

  1. load x into a register;
  2. add a to register;
  3. store register value back into x.

When one process is doing x = x + a and another is doing x = x + b concurrently, there is a data race. They might both fetch xold and operate on that, then both store their results with the effect that one overwrites the other and the stored value becomes either xold + a or xold + b, not xold + a + b as might be expected.

In uniprocessor systems with no kernel preemption supported, it is sufficient to disable interrupts before accessing a critical section. However, in multiprocessor systems (even with interrupts disabled) two or more processors could be attempting to access the same memory at the same time. The fetch-and-add instruction allows any processor to atomically increment a value in memory, preventing such multiple processor collisions.

Maurice Herlihy (1991) proved that fetch-and-add has a finite consensus number, in contrast to the compare-and-swap operation. The fetch-and-add operation can solve the wait-free consensus problem for no more than two concurrent processes. [1]

Implementation

The fetch-and-add instruction behaves like the following function. Crucially, the entire function is executed atomically: no process can interrupt the function mid-execution and hence see a state that only exists during the execution of the function. This code only serves to help explain the behaviour of fetch-and-add; atomicity requires explicit hardware support and hence can not be implemented as a simple high level function.

<< atomic >> function FetchAndAdd(address location, int inc) {     int value := *location     *location := value + inc     return value }

To implement a mutual exclusion lock, we define the operation FetchAndIncrement, which is equivalent to FetchAndAdd with inc=1. With this operation, a mutual exclusion lock can be implemented using the ticket lock algorithm as:

record locktype {     int ticketnumber     int turn }  procedure LockInit(locktype* lock) {     lock.ticketnumber := 0     lock.turn := 0 }  procedure Lock(locktype* lock) {     int myturn := FetchAndIncrement(&lock.ticketnumber) // must be atomic, since many threads might ask for a lock at the same time     while lock.turn ≠ myturn         skip // spin until lock is acquired  }  procedure UnLock(locktype* lock) {     FetchAndIncrement(&lock.turn) // this need not be atomic, since only the possessor of the lock will execute this  }

These routines provide a mutual-exclusion lock when following conditions are met:

Hardware and software support

An atomic fetch_add function appears in the C++11 standard. [2] It is available as a proprietary extension to C in the Itanium ABI specification, [3] and (with the same syntax) in GCC. [4]

x86 implementation

In the x86 architecture, the instruction ADD with the destination operand specifying a memory location is a fetch-and-add instruction that has been there since the 8086 (it just wasn't called that then), and with the LOCK prefix, is atomic across multiple processors. However, it could not return the original value of the memory location (though it returned some flags) until the 486 introduced the XADD instruction.

The following is a C implementation for the GCC compiler, for both 32- and 64-bit x86 Intel platforms, based on extended asm syntax:

staticinlineintfetch_and_add(int*variable,intvalue){__asm__volatile("lock; xaddl %0, %1":"+r"(value),"+m"(*variable)// input + output:// No input-only:"memory");returnvalue;}

History

Fetch-and-add was introduced by the Ultracomputer project, which also produced a multiprocessor supporting fetch-and-add and containing custom VLSI switches that were able to combine concurrent memory references (including fetch-and-adds) to prevent them from serializing at the memory module containing the destination operand.

See also

Related Research Articles

<span class="mw-page-title-main">Endianness</span> Order of bytes in a computer word

In computing, endianness is the order in which bytes within a word of digital data are transmitted over a data communication medium or addressed in computer memory, counting only byte significance compared to earliness. Endianness is primarily expressed as big-endian (BE) or little-endian (LE), terms introduced by Danny Cohen into computer science for data ordering in an Internet Experiment Note published in 1980. The adjective endian has its origin in the writings of 18th century Anglo-Irish writer Jonathan Swift. In the 1726 novel Gulliver's Travels, he portrays the conflict between sects of Lilliputians divided into those breaking the shell of a boiled egg from the big end or from the little end. By analogy, a CPU may read a digital word big end first, or little end first.

In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

x86-64 64-bit version of x86 architecture

x86-64 is a 64-bit version of the x86 instruction set, first announced in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on blocks or "goes to sleep".

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

<span class="mw-page-title-main">Instruction cycle</span> Basic operation cycle of a computer

The instruction cycle is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

<span class="mw-page-title-main">Linearizability</span> Property of some operation(s) in concurrent programming

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

  1. The extended list can be re-expressed as a sequential history.
  2. That sequential history is a subset of the original unextended list.

The Pentium F00F bug is a design flaw in the majority of Intel Pentium, Pentium MMX, and Pentium OverDrive processors. Discovered in 1997, it can result in the processor ceasing to function until the computer is physically rebooted. The bug has been circumvented through operating system updates.

In computer science, a ticket lock is a synchronization mechanism, or locking algorithm, that is a type of spinlock that uses "tickets" to control which thread of execution is allowed to enter a critical section.

In computer science, load-linked/store-conditional (LL/SC), sometimes known as load-reserved/store-conditional (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. Transactional memory systems provide high-level abstraction as an alternative to low-level thread synchronization. This abstraction allows for coordination between concurrent reads and writes of shared data in parallel systems.

In computer science, read–modify–write is a class of atomic operations that both read a memory location and write a new value into it simultaneously, either with a completely new value or some function of the previous value. These operations prevent race conditions in multi-threaded applications. Typically they are used to implement mutexes or semaphores. These atomic operations are also heavily used in non-blocking synchronization.

Memory ordering describes the order of accesses to computer memory by a CPU. The term can refer either to the memory ordering generated by the compiler during compile time, or to the memory ordering generated by a CPU during runtime.

In computing, the producer-consumer problem is a family of problems described by Edsger W. Dijkstra since 1965.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Array-Based Queuing Lock (ABQL) is an advanced lock algorithm that ensures that threads spin on unique memory locations thus ensuring fairness of lock acquisition coupled with improved scalability.

<span class="mw-page-title-main">Concurrent hash table</span>

A concurrent hash table or concurrent hash map is an implementation of hash tables allowing concurrent access by multiple threads using a hash function.

References

  1. Herlihy, Maurice (January 1991). "Wait-free synchronization" (PDF). ACM Trans. Program. Lang. Syst. 13 (1): 124–149. CiteSeerX   10.1.1.56.5659 . doi:10.1145/114005.102808. S2CID   2181446 . Retrieved 2007-05-20.
  2. "std::atomic::fetch_add". cppreference.com. Retrieved 1 June 2015.
  3. "Intel Itanium Processor-specific Application Binary Interface (ABI)" (PDF). Intel Corporation. 2001.
  4. "Atomic Builtins". Using the GNU Compiler Collection (GCC). Free Software Foundation. 2005.