Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously. [1]
Each parallel algorithm specifies whether a variable is shared or private. Many errors in implementation can arise if the variable is declared to be shared but the algorithm requires it to be private, or vice versa. [2]
Traditionally, parallelizing compilers could apply privatization to scalar elements only. To exploit parallelism that occurs across iterations within a parallel program (loop-level parallelism), the need grew for compilers that can also perform array variable privatization. [3] Most of today's compilers can performing array privatization with more features and functions to enhance the performance of the parallel program in general. An example is the Polaris parallelizing compiler. [4]
A shared-memory multiprocessor is a "computer system composed of multiple independent processors that execute different instruction streams". [4] The shared memory programming model is the most widely used for parallel processor designs. [1] This programming model starts by identifying possibilities for parallelism within a piece of code and then mapping these parallel tasks into threads.
The next step is to determine the scope of variables used in a parallel program, which is one of the key steps and main concerns within this model.
The next step in the model groups tasks together into bigger tasks, as there are typically more tasks than available processors. Typically, the number of execution threads that the tasks are assigned to, is chosen to be less than or equal to the number of processors, with each thread assigned to a unique processor. [1]
Right after this step, the use of variables within tasks needs to be analyzed. This step determines whether each variable should be shared-by-all or private-to-each thread. [1] This step is unique to shared-memory programming. (An alternative is message passing, in which all variables are private.) [1]
According to their behavior, the variables are then categorized as:
As it appears from their definition, Read/Write Conflicting variables introduce dependencies between different execution threads and hence prevent the automatic parallelization of the program. The two major techniques used to remove these dependencies are privatization and reduction . In reduction, each thread is provided with a copy of the R/W Conflicting variable to operate on it and produce a partial result, which is then combined with other threads' copies to produce a global result. [1] Another technique similar to privatization is called expansion, in which a scalar variable is expanded into an array, which makes each thread access a different array element. [5] If the variable to be expanded is an array, then expansion adds another dimension to the array. [6]
Dependencies – potential conflicts between different threads during execution – prevent parallelization, and these conflicts appear when we have Read/Write Conflicting variables. One technique to remove these conflicts is privatization. The basic principle involves making a private copy of a variable for each thread, rather than share one instance. This changes the category of the variable from Read/Write Conflicting to Read/Write Non-conflicting.
The actual local (private) instances of the Read/Write Conflicting variables are created at compile time, by allocating several areas of memory for the variables stored at different memory locations. The architecture of shared-memory multiprocessors helps, as threads share an address space.
There are two situations in which a variable can be described as privatizable:
Because Read/Write Conflicting variables are the only category that prevents parallelization, there is no need explicitly to declare Read-only and Read/Write Non-conflicting variables as private. Doing so will not affect the correctness of the program, but may use more memory for unnecessary copies.
Sometimes a variable can be neither privatized nor reduced to remove the read/write conflict. In these cases, the Read/Write Conflicting variable needs to be updated by different tasks at different points of time. See § Example 3.
This problem can sometimes be solved by changing the scope of parallelism to explore a different parallel region. This might produce good results, as it is often that after reanalyzing the code, some Read/Write Conflicting variables may change to Read/Write Non-conflicting. [1] If the variable still causes conflicts, the last resort is to declaring it as shared and protecting its access by some form of mutual exclusion, and providing synchronization if accesses to the variable needs to happen in a specified order to ensure correctness.
Read/Write Conflicting variables can be scalar, or compound types such as arrays, matrices, structured types an so on. Privatization can be applied to both types of variables.
When applied to scalar variables, the additional space and overhead introduced by making the extra private copies per thread is relatively small, because scalars are small. [1] However, applying privatization on arrays, matrices or other compound types is much more complex.
When dealing with arrays, the compiler tries to analyze the behavior of each array element separately and check for the order it is read and written. If each element is written before it is read in the same iteration, this array can be privatized. To do this, the compiler needs to further analyze the array to combine its accesses into sections. Moreover, the compiler should have extra functions, to be able to manipulate and deal with the array elements. For example, some array expressions may have symbolic terms, hence, to be able to privatize such array, the compiler needs to have some advanced symbolic manipulation functions. [5]
A variable can be privatized if each task will write to it before reading from it. In this case, it does not matter whether other threads are doing so. In the code below, the variable x
is used to help swap three different pairs of variables. Because it is always written to before being read, it can be privatized.
//Sequential Code: x = a; a = b; b = x; x = c; c = d; d = x; x = e; e = f; b = x;
This code cannot be made parallel without privatizing x
. With x
privatized, it can run on three different threads, each with its own private x
:
//Parallel Code: //Thread 1: x[1] = a; a = b; b = x[1]; // Thread 2: x[2] = c; c = d; d = x[2]; // Thread 3: x[3]= e; e = f; b = x[3];
Privatization is possible when a variable's value is known before it is used – even if it is written to by a different task. The code below demonstrates this. The variable x
is written to in the middle of each task, but that value could be computed when the program was compiled. By making x
private and defining it at the beginning of each task, the code can be run in parallel:
//Sequential Code: x = 1; y = x * 3; x = 4; z = y/x; a = x * 9; x = 3; b = a/x; c = x * 1; x = 11; d = c/x;
To make the sequential code above parallel, a few lines of code must be added so that x
can be privatized:
//Parallel Code //Thread 0: x[0] = 1; y = x[0] * 3; x[0] = 4; z = y/x[0]; //Thread 1: x[1] = 4; a = x[1] * 9; x[1] = 3; b = a/x[1]; //Thread 2: x[2] = 3; c = x[2] * 1; x[2] = 11; d = c/x[2];
Because of the extra code, this short example may not actually see much of a speed up. But in real-life, longer code this technique can greatly improve performance.
Privatization fails is when a variable is written in one task and read in another, and the value is not known ahead of time. An example is summing the elements of an array. The sum is a shared variable and is read/written in each iteration of the loop.
In sequential code, this works fine. But if the iterations were each done in a different thread, the wrong sum would be calculated. In this case, privatization does not work. The sum
cannot be made private because it relies on its value from the previous iteration.
//Sequential Code: sum = 0; for (i = 0; i < 100; i++) sum += a[i];
This problem can be solved in part by loop unrolling. Because it does not matter which order the elements are added, the loop can be split into an arbitrary number of parts:
// Thread 0: sum[0] = 0; for (i[0] = 0; i[0] < 100; i[0] += 3) sum[0] += a[i[0]]; // Thread 1: sum[1] = 0; for (i[1] = 1; i[1] < 100; i[1] += 3) sum[1] += a[i[1]]; // Thread 2: sum[2] = 0; for (i[2] = 2; i[2] < 100; i[2] += 3) sum[2] += a[i[2]]; // "Master" thread: wait_for_all(thread[0], thread[1], thread[2]); sum = sum[0] + sum[1] + sum[2];
OpenMP is a programming language that supports multiprocessor programming with shared memory. Because of this, Read/Write Conflicting variables will undoubtedly occur. In these cases, privatization can sometimes be used to allow parallel execution of the code.
Given the sequential code:
do i = 10, N - 1 x = (b(i) + c(i))/2 b(i) = a(i + 1) + x enddo
For each iteration of the loop, x
is written to and then read from. Because x
is only a scalar variable, the loop cannot be executed in parallel because it would be overwritten in different threads, and b(i)
would not always be assigned the correct value. [2]
Equivalent parallelized code using privatization is:
!$omp parallel do shared(a, b) private(x) do i = 10, N - 1 x = (b(i) + c(i))/2 b(i) = a(i + 1) + x enddo
Because x
is declared as private, each thread gets its own copy and the dependence is removed. [2]
Normally, when a variable is Read/Write Conflicting, the solution will be declaring it as shared and protecting access to it by mutual exclusion, providing synchronization when needed. Because mutual exclusion slows things down, this technique is avoided as much as possible.
Thus, the compiler or programmer first checks whether the variable can be reduced. If it cannot, the next check is for privatization. Privatization trades space for time, so mutual exclusion can be a better option when memory is limited.
Compared to reduction, privatization requires one modelling step: merely to analyze the code to identify the privatizable variables. On the other hand, the two steps required by reduction are: identifying the reduction variable, and then parallelizing the reduction operator. [7] By observing each of the two techniques, it is easy to tell what type of overhead each one adds to the parallel program; reduction increases the computation overhead while privatization increases the memory consumed by the program. [5]
Compared to expansion, privatization has less memory overhead. The memory space needed for privatization is proportional to the number of processors, while in expansion, it is proportional to the number of iterations. [5] Since the number of tasks is typically higher than the number of processors, the memory required by expansion is much larger than by privatization.
Changing the scope of parallelism can be done to explore a different parallel region. Doing so may sometimes greatly change the behavior of the variables. So reanalyzing the code and performing this technique may often change Read/Write Conflicting variables to be Non-conflicting. [1]
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.
A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.
D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is now a very different language drawing inspiration from other high-level programming languages, notably Java, Python, Ruby, C#, and Eiffel.
OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events, leading to unexpected or inconsistent results. It becomes a bug when one or more of the possible behaviors is undesirable.
In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.
Coarray Fortran (CAF), formerly known as F--, started as an extension of Fortran 95/2003 for parallel processing created by Robert Numrich and John Reid in the 1990s. The Fortran 2008 standard now includes coarrays, as decided at the May 2005 meeting of the ISO Fortran Committee; the syntax in the Fortran 2008 standard is slightly different from the original CAF proposal.
Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.
In computer science, a parallel random-access machine is a shared-memory abstract machine. As its name indicates, the PRAM is intended as the parallel-computing analogy to the random-access machine (RAM). In the same way that the RAM is used by sequential-algorithm designers to model algorithmic performance, the PRAM is used by parallel-algorithm designers to model parallel algorithmic performance. Similar to the way in which the RAM model neglects practical issues, such as access time to cache memory versus main memory, the PRAM model neglects such issues as synchronization and communication, but provides any (problem-size-dependent) number of processors. Algorithm cost, for instance, is estimated using two parameters O(time) and O(time × processor_number).
In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a programming language, as an extension to an existing languages.
Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.
Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.
Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.
Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :
Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.
Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.
For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.
In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.