Barrier (computer science)

Last updated

In parallel computing, a barrier is a type of synchronization method. [1] A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier. [2]

Contents

Many collective routines and directive-based parallel languages impose implicit barriers. For example, a parallel do loop in Fortran with OpenMP will not be allowed to continue on any thread until the last iteration is completed.[ citation needed ] This is in case the program relies on the result of the loop immediately after its completion. In message passing, any global communication (such as reduction or scatter) may imply a barrier.

In concurrent computing, a barrier may be in a raised or lowered state. The term latch is sometimes used to refer to a barrier that starts in the raised state and cannot be re-raised once it is in the lowered state. The term count-down latch is sometimes used to refer to a latch that is automatically lowered once a pre-determined number of threads/processes have arrived.

Implementation

Take an example for thread, known as the thread barrier. The thread barrier needs a variable to keeps the total number of threads that have entered in the barrier. [3] Whenever there are enough threads enter the barrier, it will be lifted. A synchronization primitive like mutex is also needed when implementing the thread barrier.

This thread barrier method is also known as Centralized Barrier as the threads have to be wait in front of a "central barrier" until the numbers of thread expected reached the barrier in order to lift it.

The following C code, which implemented thread barrier by using POSIX Threads will demonstrate this procedure: [1]

#include<stdio.h>#include<pthread.h>#define TOTAL_THREADS           2#define THREAD_BARRIERS_NUMBER  3#define PTHREAD_BARRIER_ATTR    NULL // pthread barrier attributetypedefstruct_thread_barrier{intthread_barrier_number;pthread_mutex_tlock;inttotal_thread;}thread_barrier;thread_barrierbarrier;voidthread_barrier_init(thread_barrier*barrier,pthread_mutexattr_t*mutex_attr,intthread_barrier_number){pthread_mutex_init(&(barrier->lock),mutex_attr);barrier->thread_barrier_number=thread_barrier_number;barrier->total_thread=0;// Init total thread to be 0}voidthread_barrier_wait(thread_barrier*barrier){if(!pthread_mutex_lock(&(barrier->lock))){barrier->total_thread+=1;pthread_mutex_unlock(&(barrier->lock));}while(barrier->total_thread<barrier->thread_barrier_number);if(!pthread_mutex_lock(&(barrier->lock))){barrier->total_thread-=1;// Decrease one thread as it has passed the thread barrierpthread_mutex_unlock(&(barrier->lock));}}voidthread_barrier_destroy(thread_barrier*barrier){pthread_mutex_destroy(&(barrier->lock));}void*thread_func(void*ptr){printf("thread id %ld is waiting at the barrier, as not enough %d threads are running ...\n",pthread_self(),THREAD_BARRIERS_NUMBER);thread_barrier_wait(&barrier);printf("The barrier is lifted, thread id %ld is running now\n",pthread_self());}intmain(){pthread_tthread_id[TOTAL_THREADS];thread_barrier_init(&barrier,PTHREAD_BARRIER_ATTR,THREAD_BARRIERS_NUMBER);for(inti=0;i<TOTAL_THREADS;i++){pthread_create(&thread_id[i],NULL,thread_func,NULL);}// As pthread_join() will block the process until all the threads it specifed are finished, // and there is not enough thread to wait at the barrier, so this process is blockedfor(inti=0;i<TOTAL_THREADS;i++){pthread_join(thread_id[i],NULL);}thread_barrier_destroy(&barrier);printf("Thread barrier is lifted\n");// This line won't be called as TOTAL_THREADS < THREAD_BARRIERS_NUMBER}

In this program, the thread barrier is defined as a struct, struct _thread_barrier, which include:

Based on the definition of barrier, we need to implement a function like thread_barrier_wait() in this program which will "monitor" the total number of thread in the program in order to life the barrier.

In this program, every thread calls thread_barrier_wait() will be blocked until THREAD_BARRIERS_NUMBER threads reach the thread barrier.

The result of that program is:

thread id <thread_id, e.g 139997337872128> is waiting at the barrier, as not enough 3 threads are running ...thread id <thread_id, e.g 139997329479424> is waiting at the barrier, as not enough 3 threads are running ...// (main process is blocked as not having enough 3 threads)// Line printf("Thread barrier is lifted\n") won't be reached

As we can see from the program, there are just only 2 threads are created. Those 2 thread both have thread_func(), as the thread function handler, which call thread_barrier_wait(&barrier), while thread barrier expected 3 threads to call thread_barrier_wait (THREAD_BARRIERS_NUMBER = 3) in order to be lifted. Change TOTAL_THREADS to 3 and the thread barrier is lifted:

thread id <thread ID, e.g 140453108946688> is waiting at the barrier, as not enough 3 threads are running ...thread id <thread ID, e.g 140453117339392> is waiting at the barrier, as not enough 3 threads are running ...thread id <thread ID, e.g 140453100553984> is waiting at the barrier, as not enough 3 threads are running ...The barrier is lifted, thread id <thread ID, e.g 140453108946688> is running nowThe barrier is lifted, thread id <thread ID, e.g 140453117339392> is running nowThe barrier is lifted, thread id <thread ID, e.g 140453100553984> is running nowThread barrier is lifted

Sense-Reversal Centralized Barrier

Beside decreasing the total thread number by one for every thread successfully passing the thread barrier, thread barrier can use opposite values to mark for every thread state as passing or stopping. [4] For example, thread 1 with state value is 0 means it's stopping at the barrier, thread 2 with state value is 1 means it has passed the barrier, thread 3's state value = 0 means it's stopping at the barrier and so on. [5] This is known as Sense-Reversal. [1]

The following C code demonstrates this: [3] [6]

#include<stdio.h>#include<stdbool.h>#include<pthread.h>#define TOTAL_THREADS           2#define THREAD_BARRIERS_NUMBER  3#define PTHREAD_BARRIER_ATTR    NULL // pthread barrier attributetypedefstruct_thread_barrier{intthread_barrier_number;inttotal_thread;pthread_mutex_tlock;boolflag;}thread_barrier;thread_barrierbarrier;voidthread_barrier_init(thread_barrier*barrier,pthread_mutexattr_t*mutex_attr,intthread_barrier_number){pthread_mutex_init(&(barrier->lock),mutex_attr);barrier->total_thread=0;barrier->thread_barrier_number=thread_barrier_number;barrier->flag=false;}voidthread_barrier_wait(thread_barrier*barrier){boollocal_sense=barrier->flag;if(!pthread_mutex_lock(&(barrier->lock))){barrier->total_thread+=1;local_sense=!local_sense;if(barrier->total_thread==barrier->thread_barrier_number){barrier->total_thread=0;barrier->flag=local_sense;pthread_mutex_unlock(&(barrier->lock));}else{pthread_mutex_unlock(&(barrier->lock));while(barrier->flag!=local_sense);// wait for flag}}}voidthread_barrier_destroy(thread_barrier*barrier){pthread_mutex_destroy(&(barrier->lock));}void*thread_func(void*ptr){printf("thread id %ld is waiting at the barrier, as not enough %d threads are running ...\n",pthread_self(),THREAD_BARRIERS_NUMBER);thread_barrier_wait(&barrier);printf("The barrier is lifted, thread id %ld is running now\n",pthread_self());}intmain(){pthread_tthread_id[TOTAL_THREADS];thread_barrier_init(&barrier,PTHREAD_BARRIER_ATTR,THREAD_BARRIERS_NUMBER);for(inti=0;i<TOTAL_THREADS;i++){pthread_create(&thread_id[i],NULL,thread_func,NULL);}// As pthread_join() will block the process until all the threads it specifed are finished, // and there is not enough thread to wait at the barrier, so this process is blockedfor(inti=0;i<TOTAL_THREADS;i++){pthread_join(thread_id[i],NULL);}thread_barrier_destroy(&barrier);printf("Thread barrier is lifted\n");// This line won't be called as TOTAL_THREADS < THREAD_BARRIERS_NUMBER}

This program has all features similar to the previous Centralized Barrier source code. It just only implements in a different way by using 2 new variables: [1]

When a thread stops at the barrier, local_sense's value is toggled. [1] When there are less than THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, those threads will keep waiting with the condition that the flag member of struct _thread_barrier is not equal to the private local_sense variable.

When there are exactly THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, the total thread number is reset to 0, and the flag is set to local_sense.

Combining Tree Barrier

The potential problem with the Centralized Barrier is that due to all the threads repeatedly accessing the global variable for pass/stop, the communication traffic is rather high, which decreases the scalability.

This problem can be resolved by regrouping the threads and using multi-level barrier, e.g. Combining Tree Barrier. Also hardware implementations may have the advantage of higher scalability.

A Combining Tree Barrier is a hierarchical way of implementing barrier to resolve the scalability by avoiding the case that all threads are spinning at the same location. [4]

In k-Tree Barrier, all threads are equally divided into subgroups of k threads and a first-round synchronizations are done within these subgroups. Once all subgroups have done their synchronizations, the first thread in each subgroup enters the second level for further synchronization. In the second level, like in the first level, the threads form new subgroups of k threads and synchronize within groups, sending out one thread in each subgroup to next level and so on. Eventually, in the final level there is only one subgroup to be synchronized. After the final-level synchronization, the releasing signal is transmitted to upper levels and all threads get past the barrier. [6] [7]

Hardware Barrier Implementation

The hardware barrier uses hardware to implement the above basic barrier model. [3]

The simplest hardware implementation uses dedicated wires to transmit signal to implement barrier. This dedicated wire performs OR/AND operation to act as the pass/block flags and thread counter. For small systems, such a model works and communication speed is not a major concern. In large multiprocessor systems this hardware design can make barrier implementation have high latency. The network connection among processors is one implementation to lower the latency, which is analogous to Combining Tree Barrier. [8]

POSIX Thread barrier functions

POSIX Threads standard directly supports thread barrier functions which can be used to block the specified threadsor the whole process at the barrier until other threads to reach that barrier. [2] 3 main API supports by POSIX to implement thread barriers are:

pthread_barrier_init()
Init the thread barrier with the number of threads needed to wait at the barrier in order to lift it [9]
pthread_barrier_destroy()
Destroy the thread barrier to release back the resource [9]
pthread_barrier_wait()
Calling this function will block the current thread until the number of threads specified by pthread_barrier_init() call pthread_barrier_wait() to lift the barrier. [10]

The following example (implemented in C with pthread API) will use thread barrier to block all the threads of the main process and therefore block the whole process:

#include<stdio.h>#include<pthread.h>#define TOTAL_THREADS           2#define THREAD_BARRIERS_NUMBER  3#define PTHREAD_BARRIER_ATTR    NULL // pthread barrier attributepthread_barrier_tbarrier;void*thread_func(void*ptr){printf("Waiting at the barrier as not enough %d threads are running ...\n",THREAD_BARRIERS_NUMBER);pthread_barrier_wait(&barrier);printf("The barrier is lifted, thread id %ld is running now\n",pthread_self());}intmain(){pthread_tthread_id[TOTAL_THREADS];pthread_barrier_init(&barrier,PTHREAD_BARRIER_ATTR,THREAD_BARRIERS_NUMBER);for(inti=0;i<TOTAL_THREADS;i++){pthread_create(&thread_id[i],NULL,thread_func,NULL);}// As pthread_join() will block the process until all the threads it specifies are finished, // and there is not enough thread to wait at the barrier, so this process is blockedfor(inti=0;i<TOTAL_THREADS;i++){pthread_join(thread_id[i],NULL);}pthread_barrier_destroy(&barrier);printf("Thread barrier is lifted\n");// This line won't be called as TOTAL_THREADS < THREAD_BARRIERS_NUMBER}

The result of that source code is:

Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...// (main process is blocked as not having enough 3 threads)// Line printf("Thread barrier is lifted\n") won't be reached

As we can see from the source code, there are just only two threads are created. Those 2 thread both have thread_func(), as the thread function handler, which call pthread_barrier_wait(&barrier), while thread barrier expected 3 threads to call pthread_barrier_wait (THREAD_BARRIERS_NUMBER = 3) in order to be lifted. Change TOTAL_THREADS to 3 and the thread barrier is lifted:

Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...Waiting at the barrier as not enough 3 threads are running ...The barrier is lifted, thread id 140643372406528 is running nowThe barrier is lifted, thread id 140643380799232 is running nowThe barrier is lifted, thread id 140643389191936 is running nowThread barrier is lifted

As main()is treated as a thread, i.e the "main" thread of the process, [11] calling pthread_barrier_wait() inside main() will block the whole process until other threads reach the barrier. The following example will use thread barrier, with pthread_barrier_wait() inside main(), to block the process/main thread for 5 seconds as waiting the 2 "newly created" thread to reach the thread barrier:

#define TOTAL_THREADS           2#define THREAD_BARRIERS_NUMBER  3#define PTHREAD_BARRIER_ATTR    NULL // pthread barrier attributepthread_barrier_tbarrier;void*thread_func(void*ptr){printf("Waiting at the barrier as not enough %d threads are running ...\n",THREAD_BARRIERS_NUMBER);sleep(5);pthread_barrier_wait(&barrier);printf("The barrier is lifted, thread id %ld is running now\n",pthread_self());}intmain(){pthread_tthread_id[TOTAL_THREADS];pthread_barrier_init(&barrier,PTHREAD_BARRIER_ATTR,THREAD_BARRIERS_NUMBER);for(inti=0;i<TOTAL_THREADS;i++){pthread_create(&thread_id[i],NULL,thread_func,NULL);}pthread_barrier_wait(&barrier);printf("Thread barrier is lifted\n");// This line won't be called as TOTAL_THREADS < THREAD_BARRIERS_NUMBERpthread_barrier_destroy(&barrier);}

This example doesn't use pthread_join() to wait for 2 "newly created" threads to complete. It calls pthread_barrier_wait() inside main(), in order to block the main thread, so that the process will be blocked until 2 threads finish its operation after 5 seconds wait (line 9 - sleep(5)).

See also

Related Research Articles

In multi-threaded computer programming, a function is thread-safe when it can be invoked or accessed concurrently by multiple threads without causing unexpected behavior, race conditions, or data corruption. As in the multi-threaded context where a program executes several threads simultaneously in a shared address space and each of those threads has access to all every other thread's memory, thread-safe functions need to ensures all those threads behave properly and fulfill their design specifications without unintended interaction.

Reentrancy is a programming concept where a function or subroutine can be interrupted and then resumed before it finishes executing. This means that the function can be called again before it completes its previous execution. Reentrant code is designed to be safe and predictable when multiple instances of the same function are called simultaneously or in quick succession. A computer program or subroutine is called reentrant if multiple invocations can safely run concurrently on multiple processors, or if on a single-processor system its execution can be interrupted and a new execution of it can be safely started. The interruption could be caused by an internal action such as a jump or call, or by an external action such as an interrupt or signal, unlike recursion, where new invocations can only be caused by internal call.

<span class="mw-page-title-main">Semaphore (programming)</span> Variable used in a concurrent system

In computer science, a semaphore is a variable or abstract data type used to control access to a common resource by multiple threads and avoid critical section problems in a concurrent system such as a multitasking operating system. Semaphores are a type of synchronization primitive. A trivial semaphore is a plain variable that is changed depending on programmer-defined conditions.

In software engineering, double-checked locking is a software design pattern used to reduce the overhead of acquiring a lock by testing the locking criterion before acquiring the lock. Locking occurs only if the locking criterion check indicates that locking is required.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

In computer science, read-copy-update (RCU) is a synchronization mechanism that avoids the use of lock primitives while multiple threads concurrently read and update elements that are linked through pointers and that belong to shared data structures.

<span class="mw-page-title-main">Dining philosophers problem</span> Problem used to illustrate synchronization issues and techniques for resolving them

In computer science, the dining philosophers problem is an example problem often used in concurrent algorithm design to illustrate synchronization issues and techniques for resolving them.

RTLinux is a hard realtime real-time operating system (RTOS) microkernel that runs the entire Linux operating system as a fully preemptive process. The hard real-time property makes it possible to control robots, data acquisition systems, manufacturing plants, and other time-sensitive instruments and machines from RTLinux applications. The design was patented. Despite the similar name, it is not related to the Real-Time Linux project of the Linux Foundation.

In computing, POSIX Threads, commonly known as pthreads, is an execution model that exists independently from a programming language, as well as a parallel execution model. It allows a program to control multiple different flows of work that overlap in time. Each flow of work is referred to as a thread, and creation and control over these flows is achieved by making calls to the POSIX Threads API. POSIX Threads is an API defined by the Institute of Electrical and Electronics Engineers (IEEE) standard POSIX.1c, Threads extensions .

In computer science and software engineering, busy-waiting, busy-looping or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input or a lock is available. Spinning can also be used to generate an arbitrary time delay, a technique that was necessary on systems that lacked a method of waiting a specific length of time. Processor speeds vary greatly from computer to computer, especially as some processors are designed to dynamically adjust speed based on current workload. Consequently, spinning as a time-delay technique can produce unpredictable or even inconsistent results on different systems unless code is included to determine the time a processor takes to execute a "do nothing" loop, or the looping code explicitly checks a real-time clock.

In concurrent programming, a monitor is a synchronization construct that prevents threads from concurrently accessing a shared object's state and allows them to wait for the state to change. They provide a mechanism for threads to temporarily give up exclusive access in order to wait for some condition to be met, before regaining exclusive access and resuming their task. A monitor consists of a mutex (lock) and at least one condition variable. A condition variable is explicitly 'signalled' when the object's state is modified, temporarily passing the mutex to another thread 'waiting' on the conditional variable.

In computing, sigaction is a function API defined by POSIX to give the programmer access to what should be a program's behavior when receiving specific OS signals.

The object pool pattern is a software creational design pattern that uses a set of initialized objects kept ready to use – a "pool" – rather than allocating and destroying them on demand. A client of the pool will request an object from the pool and perform operations on the returned object. When the client has finished, it returns the object to the pool rather than destroying it; this can be done manually or automatically.

In computer programming, the term hooking covers a range of techniques used to alter or augment the behaviour of an operating system, of applications, or of other software components by intercepting function calls or messages or events passed between software components. Code that handles such intercepted function calls, events or messages is called a hook.

In computer science, a readers–writer is a synchronization primitive that solves one of the readers–writers problems. An RW lock allows concurrent access for read-only operations, whereas write operations require exclusive access. This means that multiple threads can read the data in parallel but an exclusive lock is needed for writing or modifying data. When a writer is writing the data, all other writers and readers will be blocked until the writer is finished writing. A common use might be to control access to a data structure in memory that cannot be updated atomically and is invalid until the update is complete.

In computer science, synchronization is the task of coordinating multiple of processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

In computing, the producer-consumer problem is a family of problems described by Edsger W. Dijkstra since 1965.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

Wrapper libraries consist of a thin layer of code which translates a library's existing interface into a compatible interface. This is done for several reasons:

References

  1. 1 2 3 4 5 "Implementing Barriers". Carnegie Mellon University.
  2. 1 2 GNU Operating System. "Implementation of pthread_barrier". gnu.org. Retrieved 2024-03-02.
  3. 1 2 3 Solihin, Yan (2015-01-01). Fundamentals of Parallel Multicore Architecture (1st ed.). Chapman & Hall/CRC. ISBN   978-1482211184.
  4. 1 2 Culler, David (1998). Parallel Computer Architecture, A Hardware/Software Approach. Gulf Professional. ISBN   978-1558603431.
  5. Culler, David (1998). Parallel Computer Architecture, A Hardware/Software Approach. Gulf Professional. ISBN   978-1558603431.
  6. 1 2 Nanjegowda, Ramachandra; Hernandez, Oscar; Chapman, Barbara; Jin, Haoqiang H. (2009-06-03). Müller, Matthias S.; Supinski, Bronis R. de; Chapman, Barbara M. (eds.). Evolving OpenMP in an Age of Extreme Parallelism . Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp.  42–52. doi:10.1007/978-3-642-02303-3_4. ISBN   9783642022845.
  7. Nikolopoulos, Dimitrios S.; Papatheodorou, Theodore S. (1999-01-01). "A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems". Proceedings of the 13th international conference on Supercomputing. ICS '99. New York, NY, USA: ACM. pp. 319–328. doi:10.1145/305138.305209. ISBN   978-1581131642. S2CID   6097544. Archived from the original on 2017-07-25. Retrieved 2019-01-18.
  8. N.R. Adiga, et al. An Overview of the BlueGene/L Supercomputer. Proceedings of the Conference on High Performance Networking and Computing, 2002.
  9. 1 2 "pthread_barrier_init(), pthread_barrier_destroy()". Linux man page. Retrieved 2024-03-16.
  10. "pthread_barrier_wait()". Linux man page. Retrieved 2024-03-16.
  11. "How to get number of processes and threads in a C program?". stackoverflow . Retrieved 2024-03-16.

"Parallel Programming with Barrier Synchronization". sourceallies.com. March 2012.