Non-blocking algorithm

Last updated

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; [1] for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003. [2]

Contents

The word "non-blocking" was traditionally used to describe telecommunications networks that could route a connection through a set of relays "without having to re-arrange existing calls" (see Clos network). Also, if the telephone exchange "is not defective, it can always make the connection" (see nonblocking minimal spanning switch).

Motivation

The traditional approach to multi-threaded programming is to use locks to synchronize access to shared resources. Synchronization primitives such as mutexes, semaphores, and critical sections are all mechanisms by which a programmer can ensure that certain sections of code do not execute concurrently, if doing so would corrupt shared memory structures. If one thread attempts to acquire a lock that is already held by another thread, the thread will block until the lock is free.

Blocking a thread can be undesirable for many reasons. An obvious reason is that while the thread is blocked, it cannot accomplish anything: if the blocked thread had been performing a high-priority or real-time task, it would be highly undesirable to halt its progress.

Other problems are less obvious. For example, certain interactions between locks can lead to error conditions such as deadlock, livelock, and priority inversion. Using locks also involves a trade-off between coarse-grained locking, which can significantly reduce opportunities for parallelism, and fine-grained locking, which requires more careful design, increases locking overhead and is more prone to bugs.

Unlike blocking algorithms, non-blocking algorithms do not suffer from these downsides, and in addition are safe for use in interrupt handlers: even though the preempted thread cannot be resumed, progress is still possible without it. In contrast, global data structures protected by mutual exclusion cannot safely be accessed in an interrupt handler, as the preempted thread may be the one holding the lock—but this can be rectified easily by masking the interrupt request during the critical section. [3]

A lock-free data structure can be used to improve performance. A lock-free data structure increases the amount of time spent in parallel execution rather than serial execution, improving performance on a multi-core processor, because access to the shared data structure does not need to be serialized to stay coherent. [4]

Implementation

With few exceptions, non-blocking algorithms use atomic read-modify-write primitives that the hardware must provide, the most notable of which is compare and swap (CAS). Critical sections are almost always implemented using standard interfaces over these primitives (in the general case, critical sections will be blocking, even when implemented with these primitives). In the 1990s all non-blocking algorithms had to be written "natively" with the underlying primitives to achieve acceptable performance. However, the emerging field of software transactional memory promises standard abstractions for writing efficient non-blocking code. [5] [6]

Much research has also been done in providing basic data structures such as stacks, queues, sets, and hash tables. These allow programs to easily exchange data between threads asynchronously.

Additionally, some non-blocking data structures are weak enough to be implemented without special atomic primitives. These exceptions include:

Several libraries internally use lock-free techniques, [7] [8] [9] but it is difficult to write lock-free code that is correct. [10] [11] [12] [13]

Non-blocking algorithms generally involve a series of read, read-modify-write, and write instructions in a carefully designed order. Optimizing compilers can aggressively re-arrange operations. Even when they don't, many modern CPUs often re-arrange such operations (they have a "weak consistency model"), unless a memory barrier is used to tell the CPU not to reorder. C++11 programmers can use std::atomic in <atomic>, and C11 programmers can use <stdatomic.h>, both of which supply types and functions that tell the compiler not to re-arrange such instructions, and to insert the appropriate memory barriers. [14]

Wait-freedom

Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes. [15] This property is critical for real-time systems and is always nice to have as long as the performance cost is not too high.

It was shown in the 1980s [16] that all algorithms can be implemented wait-free, and many transformations from serial code, called universal constructions, have been demonstrated. However, the resulting performance does not in general match even naïve blocking designs. Several papers have since improved the performance of universal constructions, but still, their performance is far below blocking designs.

Several papers have investigated the difficulty of creating wait-free algorithms. For example, it has been shown [17] that the widely available atomic conditional primitives, CAS and LL/SC, cannot provide starvation-free implementations of many common data structures without memory costs growing linearly in the number of threads.

But in practice these lower bounds do not present a real barrier as spending a cache line or exclusive reservation granule (up to 2 KB on ARM) of store per thread in the shared memory is not considered too costly for practical systems (typically the amount of store logically required is a word, but physically CAS operations on the same cache line will collide, and LL/SC operations in the same exclusive reservation granule will collide, so the amount of store physically required[ citation needed ] is greater).

Wait-free algorithms were rare until 2011, both in research and in practice. However, in 2011 Kogan and Petrank [18] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expanded the lock-free queue of Michael and Scott, [19] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank [20] provided a method for making wait-free algorithms fast and used this method to make the wait-free queue practically as fast as its lock-free counterpart. A subsequent paper by Timnat and Petrank [21] provided an automatic mechanism for generating wait-free data structures from lock-free ones. Thus, wait-free implementations are now available for many data-structures.

Lock-freedom

Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.

In particular, if one thread is suspended, then a lock-free algorithm guarantees that the remaining threads can still make progress. Hence, if two threads can contend for the same mutex lock or spinlock, then the algorithm is not lock-free. (If we suspend one thread that holds the lock, then the second thread will block.)

An algorithm is lock-free if infinitely often operation by some processors will succeed in a finite number of steps. For instance, if N processors are trying to execute an operation, some of the N processes will succeed in finishing the operation in a finite number of steps and others might fail and retry on failure. The difference between wait-free and lock-free is that wait-free operation by each process is guaranteed to succeed in a finite number of steps, regardless of the other processors.

In general, a lock-free algorithm can run in four phases: completing one's own operation, assisting an obstructing operation, aborting an obstructing operation, and waiting. Completing one's own operation is complicated by the possibility of concurrent assistance and abortion, but is invariably the fastest path to completion.

The decision about when to assist, abort or wait when an obstruction is met is the responsibility of a contention manager. This may be very simple (assist higher priority operations, abort lower priority ones), or may be more optimized to achieve better throughput, or lower the latency of prioritized operations.

Correct concurrent assistance is typically the most complex part of a lock-free algorithm, and often very costly to execute: not only does the assisting thread slow down, but thanks to the mechanics of shared memory, the thread being assisted will be slowed, too, if it is still running.

Obstruction-freedom

Obstruction-freedom is the weakest natural non-blocking progress guarantee. An algorithm is obstruction-free if at any point, a single thread executed in isolation (i.e., with all obstructing threads suspended) for a bounded number of steps will complete its operation. [15] All lock-free algorithms are obstruction-free.

Obstruction-freedom demands only that any partially completed operation can be aborted and the changes made rolled back. Dropping concurrent assistance can often result in much simpler algorithms that are easier to validate. Preventing the system from continually live-locking is the task of a contention manager.

Some obstruction-free algorithms use a pair of "consistency markers" in the data structure. Processes reading the data structure first read one consistency marker, then read the relevant data into an internal buffer, then read the other marker, and then compare the markers. The data is consistent if the two markers are identical. Markers may be non-identical when the read is interrupted by another process updating the data structure. In such a case, the process discards the data in the internal buffer and tries again.

See also

Related Research Articles

<span class="mw-page-title-main">Garbage collection (computer science)</span> Form of automatic memory management

In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced; such memory is called garbage. Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.

In computer science, reference counting is a programming technique of storing the number of references, pointers, or handles to a resource, such as an object, a block of memory, disk space, and others.

<span class="mw-page-title-main">Mutual exclusion</span> In computing, restricting data to be accessible by one thread at a time

In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurrent thread of execution is already accessing said critical section, which refers to an interval of time during which a thread of execution accesses a shared resource or shared memory.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on blocks or "goes to sleep".

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

<span class="mw-page-title-main">Linearizability</span> Property of some operation(s) in concurrent programming

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

  1. The extended list can be re-expressed as a sequential history.
  2. That sequential history is a subset of the original unextended list.

In concurrent programming, a monitor is a synchronization construct that prevents threads from concurrently accessing a shared object's state and allows them to wait for the state to change. They provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task. A monitor consists of a mutex (lock) and at least one condition variable. A condition variable is explicitly 'signalled' when the object's state is modified, temporarily passing the mutex to another thread 'waiting' on the conditional variable.

In computer science, software transactional memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It is an alternative to lock-based synchronization. STM is a strategy implemented in software, rather than as a hardware component. A transaction in this context occurs when a piece of code executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions. The idea of providing hardware support for transactions originated in a 1986 paper by Tom Knight. The idea was popularized by Maurice Herlihy and J. Eliot B. Moss. In 1995, Nir Shavit and Dan Touitou extended this idea to software-only transactional memory (STM). Since 2005, STM has been the focus of intense research and support for practical implementations is growing.

In computer science, a ticket lock is a synchronization mechanism, or locking algorithm, that is a type of spinlock that uses "tickets" to control which thread of execution is allowed to enter a critical section.

Double compare-and-swap is an atomic primitive proposed to support certain concurrent programming techniques. DCAS takes two not necessarily contiguous memory locations and writes new values into them only if they match pre-supplied "expected" values; as such, it is an extension of the much more popular compare-and-swap (CAS) operation.

In computer science, load-linked/store-conditional (LL/SC), sometimes known as load-reserved/store-conditional (LR/SC), are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free, atomic, read-modify-write operation.

In computer science, a readers–writer is a synchronization primitive that solves one of the readers–writers problems. An RW lock allows concurrent access for read-only operations, whereas write operations require exclusive access. This means that multiple threads can read the data in parallel but an exclusive lock is needed for writing or modifying data. When a writer is writing the data, all other writers and readers will be blocked until the writer is finished writing. A common use might be to control access to a data structure in memory that cannot be updated atomically and is invalid until the update is complete.

In computer science, synchronization is the task of coordinating multiple of processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

In computer science, a mark–compact algorithm is a type of garbage collection algorithm used to reclaim unreachable memory. Mark–compact algorithms can be regarded as a combination of the mark–sweep algorithm and Cheney's copying algorithm. First, reachable objects are marked, then a compacting step relocates the reachable (marked) objects towards the beginning of the heap area. Compacting garbage collection is used by modern JVMs, Microsoft's Common Language Runtime and by the Glasgow Haskell Compiler.

oneAPI Threading Building Blocks, is a C++ template library developed by Intel for parallel programming on multi-core processors. Using TBB, a computation is broken down into tasks that can run in parallel. The library manages and schedules threads to execute these tasks.

In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads on a computer.

A non-blocking linked list is an example of non-blocking data structures designed to implement a linked list in shared memory using synchronization primitives:

References

  1. Göetz, Brian; Peierls, Tim; Bloch, Joshua; Bowbeer, Joseph; Holmes, David; Lea, Doug (2006). Java concurrency in practice . Upper Saddle River, NJ: Addison-Wesley. p.  41. ISBN   9780321349606.
  2. Herlihy, M.; Luchangco, V.; Moir, M. (2003). Obstruction-Free Synchronization: Double-Ended Queues as an Example (PDF). 23rd International Conference on Distributed Computing Systems. p. 522.
  3. Butler W. Lampson; David D. Redell (February 1980). "Experience with Processes and Monitors in Mesa". Communications of the ACM. 23 (2): 105–117. CiteSeerX   10.1.1.142.5765 . doi:10.1145/358818.358824. S2CID   1594544.
  4. Guillaume Marçais, and Carl Kingsford. "A fast, lock-free approach for efficient parallel counting of occurrences of k-mers". Bioinformatics (2011) 27(6): 764-770. doi : 10.1093/bioinformatics/btr011 "Jellyfish mer counter".
  5. Harris, Tim; Fraser, Keir (26 November 2003). "Language support for lightweight transactions" (PDF). ACM SIGPLAN Notices. 38 (11): 388. CiteSeerX   10.1.1.58.8466 . doi:10.1145/949343.949340.
  6. Harris, Tim; Marlow, S.; Peyton-Jones, S.; Herlihy, M. (June 15–17, 2005). "Composable memory transactions". Proceedings of the 2005 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05 : Chicago, Illinois. New York, NY: ACM Press. pp. 48–60. doi:10.1145/1065944.1065952. ISBN   978-1-59593-080-4. S2CID   53245159.
  7. libcds - C++ library of lock-free containers and safe memory reclamation schema
  8. liblfds - A library of lock-free data structures, written in C
  9. Concurrency Kit - A C library for non-blocking system design and implementation
  10. Herb Sutter. "Lock-Free Code: A False Sense of Security". Archived from the original on 2015-09-01.
  11. Herb Sutter. "Writing Lock-Free Code: A Corrected Queue". Archived from the original on 2008-12-05.
  12. Herb Sutter. "Writing a Generalized Concurrent Queue".
  13. Herb Sutter. "The Trouble With Locks".
  14. Bruce Dawson. "ARM and Lock-Free Programming".
  15. 1 2 Anthony Williams. "Safety: off: How not to shoot yourself in the foot with C++ atomics". 2015. p. 20.
  16. Herlihy, Maurice P. (1988). Impossibility and universality results for wait-free synchronization. Proc. 7th Annual ACM Symp. on Principles of Distributed Computing. pp. 276–290. doi: 10.1145/62546.62593 . ISBN   0-89791-277-2.
  17. Fich, Faith; Hendler, Danny; Shavit, Nir (2004). On the inherent weakness of conditional synchronization primitives. Proc. 23rd Annual ACM Symp.on Principles of Distributed Computing (PODC). pp. 80–87. doi:10.1145/1011767.1011780. ISBN   1-58113-802-4.
  18. Kogan, Alex; Petrank, Erez (2011). Wait-free queues with multiple enqueuers and dequeuers (PDF). Proc. 16th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP). pp. 223–234. doi:10.1145/1941553.1941585. ISBN   978-1-4503-0119-0.
  19. Michael, Maged; Scott, Michael (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symp. on Principles of Distributed Computing (PODC). pp. 267–275. doi: 10.1145/248052.248106 . ISBN   0-89791-800-2.
  20. Kogan, Alex; Petrank, Erez (2012). A method for creating fast wait-free data structures. Proc. 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP). pp. 141–150. doi:10.1145/2145816.2145835. ISBN   978-1-4503-1160-1.
  21. Timnat, Shahar; Petrank, Erez (2014). A Practical Wait-Free Simulation for Lock-Free Data Structures. Proc. 17th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP). pp. 357–368. doi:10.1145/2692916.2555261. ISBN   978-1-4503-2656-8.