Array Based Queuing Locks

Last updated April 24, 2024

Array-Based Queuing Lock (ABQL) is an advanced lock algorithm that ensures that threads spin on unique memory locations thus ensuring fairness of lock acquisition coupled with improved scalability.

Overview

Synchronization is a major issue in the designing and programming of shared memory ^[1] multiprocessors. The common problem with lock implementations is the high network contention due to the processors spinning on a shared synchronization flag or memory location. Thus the scalability of the lock is reduced significantly in terms of the number of contending processors.

The Array Based Queuing Lock is an extension to the ticket lock algorithm which ensures that, on a lock release, only one processor attempts to acquire the lock, decreasing the number of cache misses. This effect can be achieved by having all the processors spin on unique memory locations.^[2] One analogy used to explain the lock mechanism is the relay race where the athlete passes on the baton to the next athlete in the queue, which ensures that only one athlete acquires the baton at a time.

ABQL also guarantees fairness in lock acquisition by using a first in, first out (FIFO) queue-based mechanism. Additionally, the amount of invalidation is significantly less than ticket-based lock implementations since only one processor incurs a cache miss on a lock release.

Implementation

The foremost requirement of the implementation of array based queuing lock is to ensure that all the threads spin on unique memory locations. This is achieved with an array of length equal to the number of threads which are in contention for the lock. The elements of the array are all initialized to 0 except the first element which is takes the value 1, thus ensuring successful lock acquisition by the first thread in the queue. On a lock release, the hold is passed to the next thread in queue by setting the next element of the array to 1. The requests are granted to the threads in FIFO ordering.

Pseudo Code example is listed below.

ABQL_init(int*next_ticket,int*can_serve){*next_ticket=0;for(inti=1;i<MAXSIZE;i++)can_serve[i]=0;can_serve[0]=1;}ABQL_acquire(int*next_ticket,int*can_serve){*my_ticket=fetch_and_inc(next_ticket);while(can_serve[*my_ticket]!=1);}ABQL_release(int*can_serve){can_serve[*my_ticket]=0;// prepare for next timecan_serve[*my_ticket+1]=1;}

To implement ABQL in the pseudo code above, 3 variables are introduced namely can_serve, next_ticket and my_ticket. The roles of each are described below:

can_serve array represents the unique memory locations that the threads waiting in the queue for the lock acquisitions spin on.
next_ticket represents the next available ticket number that is assigned to a new thread.
my_ticket represents the ticket thread of each unique thread in the queue.

In the initialization method (ABQL_init), the variable next_ticket is initialized to 0. All the elements of the can_serve array except the first element are initialized to 0. Initialization of the first element in the array can_serve to 1, ensures successful lock acquisition by the first thread in the queue.

The acquire method uses an atomic operation fetch_and_inc to fetch the next available ticket number (afterwards the ticket number is incremented by 1) that the new thread will use to spin on. The threads in the queue spin on their locations until the value of my_ticket is set to 1 by the previous thread. On acquiring the lock the thread enters the critical section of the code.

On release of a lock by a thread, the control is passed to the next thread by setting the next element in the array can_serve to 1. The next thread which was waiting to acquire the lock can now do so successfully.

The working of ABQL can be depicted in the table below by assuming 4 processors contending to enter the critical section with the assumption that a thread enters the critical section only once.

Execution Steps	next_ticket	can_serve	my_ticket(P2)	my_ticket(P3)	my_ticket(P4)	Comments
initially	0	[1, 0, 0, 0]	0	0	0	initial value of all variables is 0
P1: fetch_and_inc	1	[1, 0, 0, 0]	0	0	0	P1 attempts and successfully acquires the lock
P2: fetch_and_inc	2	[1, 0, 0, 0]	1	0	0	P2 attempts to acquire the lock
P3: fetch_and_inc	3	[1, 0, 0, 0]	1	2	0	P3 attempts to acquire the lock
P4: fetch_and_inc	4	[1, 0, 0, 0]	1	2	3	P4 attempts to acquire the lock
P1: can_serve[1] = 1; can_serve[0] = 0	4	[0, 1, 0, 0]	1	2	3	P1 releases the lock and P2 successfully acquires the lock
P2: can_serve[2] = 1; can_serve[1] = 0	4	[0, 0, 1, 0]	1	2	3	P2 releases the lock and P3 successfully acquires the lock
P3: can_serve[3] = 1; can_serve[2] = 0	4	[0, 0, 0, 1]	1	2	3	P3 releases the lock and P4 successfully acquires the lock
P4: can_serve[3] = 0	4	[0, 0, 0, 0]	1	2	3	P4 releases the lock

Performance metrics

The following performance criteria can be used to analyse the lock implementations:

Uncontended lock-acquisition latency - It is defined as the time taken by a thread to acquire a lock when there is no contention between threads. Due to relatively more number of instructions being executed as opposed to other lock implementations, the uncontented lock acquisition latency for ABQL is high.
Traffic - It is defined as number of bus transactions generated which is dependent on the number of threads in contention for the lock. On a lock release, only 1 cache block is invalidated thus resulting in a single cache miss. This results in much less bus traffic.
Fairness - It ensures that all processors waiting to acquire the lock are able to do so in the order in which their requests are issued. Due to the threads waiting in a queue to acquire a lock with each thread spinning on an individual memory location, fairness is ensured.
Storage - The amount of memory required for the processing of the lock mechanism. The storage requirement scales with the number of threads due to the increase in the size of the array can_serve.

Advantages

ABQL offers improved scalability as each lock release and acquisition triggers only one cache miss resulting in only one cache block suffering from a cache miss to reload the block.
Fairness of lock acquisitions is ensured due to the use of a queue which ensures that the threads acquire the lock in the order in which their requests are issued.

Disadvantages

ABQL should not be used with threads that can be suspended (sleep or context switch) as any thread not immediately ready to acquire the lock will increase the latency of all those behind it waiting for the lock.
Each lock requires an array with a number of entries equal to the number of processors, resulting in linear space complexity. This is generally wasteful as it is rare that all processors would contend for the same lock simultaneously.

Related Research Articles

<span class="mw-page-title-main">Semaphore (programming)</span> Variable used in a concurrent system

In computer science, a semaphore is a variable or abstract data type used to control access to a common resource by multiple threads and avoid critical section problems in a concurrent system such as a multitasking operating system. Semaphores are a type of synchronization primitive. A trivial semaphore is a plain variable that is changed depending on programmer-defined conditions.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking whether the lock is available. Since the thread remains active but is not performing a useful task, the use of such a lock is a kind of busy waiting. Once acquired, spinlocks will usually be held until they are explicitly released, although in some implementations they may be automatically released if the thread being waited on blocks or "goes to sleep".

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

In computer science, the test-and-set instruction is an instruction used to write (set) 1 to a memory location and return its old value as a single atomic operation. The caller can then "test" the result to see if the state was changed by the call. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process's test-and-set is finished. A central processing unit (CPU) may use a test-and-set instruction offered by another electronic component, such as dual-port RAM; a CPU itself may also offer a test-and-set instruction.

In computer science, an algorithm is called non-blocking if failure or suspension of any thread cannot cause failure or suspension of another thread; for some operations, these algorithms provide a useful alternative to traditional blocking implementations. A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. "Non-blocking" was used as a synonym for "lock-free" in the literature until the introduction of obstruction-freedom in 2003.

Release consistency is one of the synchronization-based consistency models used in concurrent programming.

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.

In concurrent programming, a monitor is a synchronization construct that prevents threads from concurrently accessing a shared object's state and allows them to wait for the state to change. They provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task. A monitor consists of a mutex (lock) and at least one condition variable. A condition variable is explicitly 'signalled' when the object's state is modified, temporarily passing the mutex to another thread 'waiting' on the conditional variable.

In computer science, a ticket lock is a synchronization mechanism, or locking algorithm, that is a type of spinlock that uses "tickets" to control which thread of execution is allowed to enter a critical section.

In computer science, the fetch-and-add (FAA) CPU instruction atomically increments the contents of a memory location by a specified value.

Lamport's bakery algorithm is a computer algorithm devised by computer scientist Leslie Lamport, as part of his long study of the formal correctness of concurrent systems, which is intended to improve the safety in the usage of shared resources among multiple threads by means of mutual exclusion.

In computer science, a readers–writer is a synchronization primitive that solves one of the readers–writers problems. An RW lock allows concurrent access for read-only operations, whereas write operations require exclusive access. This means that multiple threads can read the data in parallel but an exclusive lock is needed for writing or modifying data. When a writer is writing the data, all other writers and readers will be blocked until the writer is finished writing. A common use might be to control access to a data structure in memory that cannot be updated atomically and is invalid until the update is complete.

In computer science, synchronization is the task of coordinating multiple of processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

In parallel computing, a barrier is a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.

The Java programming language and the Java virtual machine (JVM) is designed to support concurrent programming. All execution takes place in the context of threads. Objects and resources can be accessed by many separate threads. Each thread has its own path of execution, but can potentially access any object in the program. The programmer must ensure read and write access to objects is properly coordinated between threads. Thread synchronization ensures that objects are modified by only one thread at a time and prevents threads from accessing partially updated objects during modification by another thread. The Java language has built-in constructs to support this coordination.

In computing, the producer-consumer problem is a family of problems described by Edsger W. Dijkstra since 1965.

In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads on a computer.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

References

↑ "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors".
↑ Anderson, James H.; Kim, Yong-Jik; Herman, Ted (January 2003) [First Published in June 2001]. "Shared-memory Mutual Exclusion: Major Research Trends Since 1986∗" (PDF).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors".

[2] Anderson, James H.; Kim, Yong-Jik; Herman, Ted (January 2003) [First Published in June 2001]. "Shared-memory Mutual Exclusion: Major Research Trends Since 1986∗" (PDF).

[1]

[2]