Test-and-set

Last updated

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic (i.e., non-interruptible) operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

Contents

A lock can be built using an atomic test-and-set [1] instruction as follows:

This code assumes that the memory location was initialized to 0 at some point prior to the first test-and-set. The calling process obtains the lock if the old value was 0, otherwise the while-loop spins waiting to acquire the lock. This is called a spinlock. At any point, the holder of the lock can simply set the memory location back to 0 to release the lock for acquisition by another--this does not require any special handling as the holder "owns" this memory location. "Test and test-and-set" is another example.

Maurice Herlihy (1991) proved that test-and-set (1-bit comparand) has a finite consensus number and can solve the wait-free consensus problem for at-most two concurrent processes. [2] In contrast, compare-and-swap (32-bit comparand) offers a more general solution to this problem, and in some implementations compare-double-and-swap (64-bit comparand) is also available for extended utility.

Hardware implementation of test-and-set

DPRAM test-and-set instructions can work in many ways. Here are two variations, both of which describe a DPRAM which provides exactly 2 ports, allowing 2 separate electronic components (such as 2 CPUs) access to every memory location on the DPRAM.

Variation 1

When CPU 1 issues a test-and-set instruction, the DPRAM first makes an "internal note" of this by storing the address of the memory location in a special place. If at this point, CPU 2 happens to issue a test-and-set instruction for the same memory location, the DPRAM first checks its "internal note", recognizes the situation, and issues a BUSY interrupt, which tells CPU 2 that it must wait and retry. This is an implementation of a busy waiting or spinlock using the interrupt mechanism. Since all this happens at hardware speeds, CPU 2's wait to get out of the spin-lock is very short.

Whether or not CPU 2 was trying to access the memory location, the DPRAM performs the test given by CPU 1. If the test succeeds, the DPRAM sets the memory location to the value given by CPU 1. Then the DPRAM wipes out its "internal note" that CPU 1 was writing there. At this point, CPU 2 could issue a test-and-set, which would succeed.

Variation 2

CPU 1 issues a test-and-set instruction to write to "memory location A". The DPRAM does not immediately store the value in memory location A, but instead simultaneously moves the current value to a special register, while setting the contents of memory location A to a special "flag value". If at this point, CPU 2 issues a test-and-set to memory location A, the DPRAM detects the special flag value, and as in Variation 1, issues a BUSY interrupt.

Whether or not CPU 2 was trying to access the memory location, the DPRAM now performs CPU 1's test. If the test succeeds, the DPRAM sets memory location A to the value specified by CPU 1. If the test fails, the DPRAM copies the value back from the special register to memory location A. Either operation wipes out the special flag value. If CPU 2 now issues a test-and-set, it will succeed.

Software implementation of test-and-set

Some instruction sets have an atomic test-and-set machine language instruction. Examples include x86 [3] and IBM System/360 and its successors (including z/Architecture). [4] Those that do not can still implement an atomic test-and-set using a read-modify-write or compare-and-swap instruction.

The test and set instruction, when used with boolean values, uses logic like that shown in the following function, except that the function must execute atomically. That is, no other process must be able to interrupt the function mid-execution, thereby seeing a state that only exists while the function executes. That requires hardware support; it cannot be implemented as shown. Nevertheless, the code shown helps to explain the behaviour of test-and-set. NOTE: In this example, 'lock' is assumed to be passed by reference (or by name) but the assignment to 'initial' creates a new value (not just copying a reference).

function TestAndSet(boolean_ref lock) {     boolean initial = lock;     lock = true;     return initial; }

Not only is the code shown not atomic, in the sense of the test-and-set instruction, it also differs from the descriptions of DPRAM hardware test-and-set above. Here, the value being set and the test are fixed and invariant, and the value is updated regardless of the outcome of the test, whereas for the DPRAM test-and-set, the memory is set only when the test succeeds, and the value to set and the test condition are specified by the CPU. Here, the value to set can only be 1, but if 0 and 1 are considered the only valid values for the memory location, and "value is nonzero" is the only allowed test, then this equates to the case described for DPRAM hardware (or, more specifically, the DPRAM case reduces to this under these constraints). From that viewpoint, this can, correctly, be called "test-and-set" in the full, conventional sense of that term. The essential point to note is the general intent and principle of test-and-set: a value is both tested and set in one atomic operation such that no other program thread or process can change the target memory location after it is tested but before it is set. (This is because the location must only be set if it currently has a certain value, not if it had that value sometime earlier.)

In the C programming language, the implementation would be like:

#define LOCKED 1inttest_and_set(int*lockPtr){intoldValue;// -- Start of atomic segment --// This should be interpreted as pseudocode for illustrative purposes only.// Traditional compilation of this code will not guarantee atomicity, the// use of shared memory (i.e., non-cached values), protection from compiler// optimizations, or other required properties.oldValue=*lockPtr;*lockPtr=LOCKED;// -- End of atomic segment --returnoldValue;}

The code also shows that there are really two operations: an atomic read-modify-write and a test. Only the read-modify-write needs to be atomic. (This is true because delaying the value comparison by any amount of time will not change the result of the test once the value to test has been obtained. Once the code writes the initial value, the result of the test has been established, even if it has not been computed yet — e.g., by the == operator.)

Mutual exclusion using test-and-set

One way to implement mutual exclusion is by using a test-and-set based lock [5] [6] as follows:

Pseudo-C implementation of a spin lock

volatileintlock=0;voidcritical(){// Spin lock: loop forever until we get the lock; we know the lock was// successfully obtained after exiting this while loop because the // test_and_set() function locks the lock and returns the previous lock // value. If the previous lock value was 1 then the lock was **already**// locked by another thread or process. Once the previous lock value// was 0, however, then it indicates the lock was **not** locked before we// locked it, but now it **is** locked because we locked it, indicating// we own the lock.while(test_and_set(&lock)==1);criticalsection// only one process can be in this section at a timelock=0;// release lock when finished with the critical section}

The lock variable is a shared variable i.e. it can be accessed by all processors/threads. Note the volatile keyword. In absence of volatile, the compiler and/or the CPU(s) may optimize access to lock and/or use cached values, thus rendering the above code erroneous. Conversely, and unfortunately, the presence of volatile does not guarantee that reads and writes are committed to memory. Some compilers issue memory barriers to ensure that operations are committed to memory, but since the semantics of volatile in C/C++ is quite vague, not all compilers will do that. Consult your compiler's documentation to determine if it does.

This spin lock function can be called by multiple processes, but it is guaranteed that only one process will be in the critical section at a time. The rest of the processes will keep spinning until they get the lock. It is possible that a process is never granted the lock. In such a case it will loop endlessly. This is a drawback of a spin lock implementation as it doesn't ensure fairness. These issues are further elaborated in the performance section.

Assembly implementation

enter_region:; A "jump to" tag; function entry point.tslreg,flag; Test and Set Lock; flag is the; shared variable; it is copied; into the register reg and flag; then atomically set to 1.cmpreg,#0        ; Was flag zero on entry_region?jnzenter_region; Jump to enter_region if; reg is non-zero; i.e.,; flag was non-zero on entry.ret; Exit; i.e., flag was zero on; entry. If we get here, tsl; will have set it non-zero; thus,; we have claimed the resource; associated with flag.leave_region:moveflag,#0      ; store 0 in flagret; return to caller

Here tsl is an atomic instruction and flag is the lock variable. The process doesn't return unless it acquires the lock.

Performance evaluation of test-and-set locks

The four major evaluation metrics for locks in general are uncontended lock-acquisition latency, bus traffic, fairness, and storage. [7]

Test-and-set scores low on two of them, namely, high bus traffic and unfairness.

When processor P1 has obtained a lock and processor P2 is also waiting for the lock, P2 will keep incurring bus transactions in attempts to acquire the lock. When a processor has obtained a lock, all other processors which also wish to obtain the same lock keep trying to obtain the lock by initiating bus transactions repeatedly until they get hold of the lock. This increases the bus traffic requirement of test-and-set significantly. This slows down all other traffic from cache and coherence misses. It slows down the overall section, since the traffic is saturated by failed lock acquisition attempts. Test-and-test-and-set is an improvement over TSL since it does not initiate lock acquisition requests continuously.

When we consider fairness, we consider if a processor gets a fair chance of acquiring the lock when it is set free. In an extreme situation the processor might starve i.e. it might not be able to acquire the lock for an extended period of time even though it has become free during that time.

Storage overhead for TSL is next to nothing since only one lock is required. Uncontended latency is also low since only one atomic instruction and branch are needed.

See also

Related Research Articles

Dekker's algorithm is the first known correct solution to the mutual exclusion problem in concurrent programming where processes only communicate via shared memory. The solution is attributed to Dutch mathematician Th. J. Dekker by Edsger W. Dijkstra in an unpublished paper on sequential process descriptions and his manuscript on cooperating sequential processes. It allows two threads to share a single-use resource without conflict, using only shared memory for communication.

<span class="mw-page-title-main">Process (computing)</span> Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In software engineering, double-checked locking is a software design pattern used to reduce the overhead of acquiring a lock by testing the locking criterion before acquiring the lock. Locking occurs only if the locking criterion check indicates that locking is required.

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on blocks or "goes to sleep".

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

Peterson's algorithm is a concurrent programming algorithm for mutual exclusion that allows two or more processes to share a single-use resource without conflict, using only shared memory for communication. It was formulated by Gary L. Peterson in 1981. While Peterson's original formulation worked with only two processes, the algorithm can be generalized for more than two.

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In computer science and software engineering, busy-waiting, busy-looping or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input or a lock is available. Spinning can also be used to generate an arbitrary time delay, a technique that was necessary on systems that lacked a method of waiting a specific length of time. Processor speeds vary greatly from computer to computer, especially as some processors are designed to dynamically adjust speed based on current workload. Consequently, spinning as a time-delay technique can produce unpredictable or even inconsistent results on different systems unless code is included to determine the time a processor takes to execute a "do nothing" loop, or the looping code explicitly checks a real-time clock.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

In computer programming, volatile means that a value is prone to change over time, outside the control of some code. Volatility has implications within function calling conventions, and also impacts how variables are stored, accessed and cached.

<span class="mw-page-title-main">Linearizability</span> Property of some operation(s) in concurrent programming

In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:

  1. The extended list can be re-expressed as a sequential history.
  2. That sequential history is a subset of the original unextended list.

Memory segmentation is an operating system memory management technique of division of a computer's primary memory into segments or sections. In a computer system using segmentation, a reference to a memory location includes a value that identifies a segment and an offset within that segment. Segments or sections are also used in object files of compiled programs when they are linked together into a program image and when the image is loaded into memory.

In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become false. Monitors also have a mechanism for signaling other threads that their condition has been met. A monitor consists of a mutex (lock) object and condition variables. A condition variable essentially is a container of threads that are waiting for a certain condition. Monitors provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task.

In computer science, a ticket lock is a synchronization mechanism, or locking algorithm, that is a type of spinlock that uses "tickets" to control which thread of execution is allowed to enter a critical section.

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

Lamport's bakery algorithm is a computer algorithm devised by computer scientist Leslie Lamport, as part of his long study of the formal correctness of concurrent systems, which is intended to improve the safety in the usage of shared resources among multiple threads by means of mutual exclusion.

In computing, the producer-consumer problem is a family of problems described by Edsger W. Dijkstra since 1965.

<span class="mw-page-title-main">Circular buffer</span> A circular shaped buffer shown while obtaining data

In computer science, a circular buffer, circular queue, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams. There were early circular buffer implementations in hardware.

Array-Based Queuing Lock (ABQL) is an advanced lock algorithm that ensures that threads spin on unique memory locations thus ensuring fairness of lock acquisition coupled with improved scalability.

References

  1. Anderson, T. E. (1990-01-01). "The performance of spin lock alternatives for shared-money multiprocessors". IEEE Transactions on Parallel and Distributed Systems. 1 (1): 6–16. doi:10.1109/71.80120. ISSN   1045-9219.
  2. Herlihy, Maurice (January 1991). "Wait-free synchronization" (PDF). ACM Trans. Program. Lang. Syst. 13 (1): 124–149. CiteSeerX   10.1.1.56.5659 . doi:10.1145/114005.102808. S2CID   2181446 . Retrieved 2007-05-20.
  3. "BTS—Bit Test and Set". www.felixcloutier.com. Retrieved 2016-11-21.
  4. "IBM Knowledge Center". www.ibm.com. Retrieved 2016-11-21.
  5. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau (2015). Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books via http://www.ostep.org/.{{cite book}}: External link in |via= (help)
  6. Solihin, Yan (2009). Fundamentals of parallel computer architecture : multichip and multicore systems. p. 252. ISBN   9780984163007.
  7. Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN   978-1-4822-1118-4.