Loop-level parallelism

Last updated May 02, 2024

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

Description
Example
Dependencies in code
Example of true dependence
Example of anti-dependence
Example of output-dependence
Example of input-dependence
Dependence in loops
Loop-carried vs loop-independent dependence
Loop carried dependence graph
Types
DISTRIBUTED loop
DOALL parallelism
DOACROSS parallelism
DOPIPE parallelism
See also
References

Description

For simple loops, where each iteration is independent of the others, loop-level parallelism can be embarrassingly parallel, as parallelizing only requires assigning a process to handle each iteration. However, many algorithms are designed to run sequentially, and fail when parallel processes race due to dependence within the code. Sequential algorithms are sometimes applicable to parallel contexts with slight modification. Usually, though, they require process synchronization. Synchronization can be either implicit, via message passing, or explicit, via synchronization primitives like semaphores.

Example

Consider the following code operating on a list L of length n.

for(inti=0;i<n;++i){S1:L[i]+=10;}

Each iteration of the loop takes the value from the current index of L, and increments it by 10. If statement S1 takes T time to execute, then the loop takes time n * T to execute sequentially, ignoring time taken by loop constructs. Now, consider a system with p processors where p > n. If n threads run in parallel, the time to execute all n steps is reduced to T.

Less simple cases produce inconsistent, i.e. non-serializable outcomes. Consider the following loop operating on the same list L.

for(inti=1;i<n;++i){S1:L[i]=L[i-1]+10;}

Each iteration sets the current index to be the value of the previous plus ten. When run sequentially, each iteration is guaranteed that the previous iteration will already have the correct value. With multiple threads, process scheduling and other considerations prevent the execution order from guaranteeing an iteration will execute only after its dependence is met. It very well may happen before, leading to unexpected results. Serializability can be restored by adding synchronization to preserve the dependence on previous iterations.

Dependencies in code

There are several types of dependences that can be found within code.^[1]^[2]

Type	Notation	Description
True (Flow) Dependence	`S1 ->T S2`	A true dependence between S1 and S2 means that S1 writes to a location later read from by S2
Anti Dependence	`S1 ->A S2`	An anti-dependence between S1 and S2 means that S1 reads from a location later written to by S2.
Output Dependence	`S1 ->O S2`	An output dependence between S1 and S2 means that S1 and S2 write to the same location.
Input Dependence	`S1 ->I S2`	An input dependence between S1 and S2 means that S1 and S2 read from the same location.

In order to preserve the sequential behaviour of a loop when run in parallel, True Dependence must be preserved. Anti-Dependence and Output Dependence can be dealt with by giving each process its own copy of variables (known as privatization).^[1]

Example of true dependence

S1:inta,b;S2:a=2;S3:b=a+40;

S2 ->T S3, meaning that S2 has a true dependence on S3 because S2 writes to the variable a, which S3 reads from.

Example of anti-dependence

S1:inta,b=40;S2:a=b-38;S3:b=-1;

S2 ->A S3, meaning that S2 has an anti-dependence on S3 because S2 reads from the variable b before S3 writes to it.

Example of output-dependence

S1:inta,b=40;S2:a=b-38;S3:a=2;

S2 ->O S3, meaning that S2 has an output dependence on S3 because both write to the variable a.

Example of input-dependence

S1:inta,b,c=2;S2:a=c-1;S3:b=c+1;

S2 ->I S3, meaning that S2 has an input dependence on S3 because S2 and S3 both read from variable c.

Dependence in loops

Loop-carried vs loop-independent dependence

Loops can have two types of dependence:

Loop-carried dependence
Loop-independent dependence

In loop-independent dependence, loops have inter-iteration dependence, but do not have dependence between iterations. Each iteration may be treated as a block and performed in parallel without other synchronization efforts.

In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of S1 ->T S3.

for(inti=1;i<n;++i){S1:tmp=a[i];S2:a[i]=b[i];S3:b[i]=tmp;}

In loop-carried dependence, statements in an iteration of a loop depend on statements in another iteration of the loop. Loop-Carried Dependence uses a modified version of the dependence notation seen earlier.

Example of loop-carried dependence where S1[i] ->T S1[i + 1], where i indicates the current iteration, and i + 1 indicates the next iteration.

for(inti=1;i<n;++i){S1:a[i]=a[i-1]+1;}

Loop carried dependence graph

A Loop-carried dependence graph graphically shows the loop-carried dependencies between iterations. Each iteration is listed as a node on the graph, and directed edges show the true, anti, and output dependencies between each iteration.

Types

There are a variety of methodologies for parallelizing loops.

DISTRIBUTED Loop
DOALL Parallelism
DOACROSS Parallelism
HELIX ^[3]
DOPIPE Parallelism

Each implementation varies slightly in how threads synchronize, if at all. In addition, parallel tasks must somehow be mapped to a process. These tasks can either be allocated statically or dynamically. Research has shown that load-balancing can be better achieved through some dynamic allocation algorithms than when done statically.^[4]

The process of parallelizing a sequential program can be broken down into the following discrete steps.^[1] Each concrete loop-parallelization below implicitly performs them.

Type	Description
Decomposition	The program is broken down into tasks, the smallest exploitable unit of concurrence.
Assignment	Tasks are assigned to processes.
Orchestration	Data access, communication, and synchronization of processes.
Mapping	Processes are bound to processors.

DISTRIBUTED loop

When a loop has a loop-carried dependence, one way to parallelize it is to distribute the loop into several different loops. Statements that are not dependent on each other are separated so that these distributed loops can be executed in parallel. For example, consider the following code.

for(inti=1;i<n;++i){S1:a[i]=a[i-1]+b[i];S2:c[i]+=d[i];}

The loop has a loop carried dependence S1[i] ->T S1[i+1] but S2 and S1 do not have a loop-independent dependence so we can rewrite the code as follows.

loop1:for(inti=1;i<n;++i){S1:a[i]=a[i-1]+b[i];}loop2:for(inti=1;i<n;++i){S2:c[i]+=d[i];}

Note that now loop1 and loop2 can be executed in parallel. Instead of single instruction being performed in parallel on different data as in data level parallelism, here different loops perform different tasks on different data. Let's say the time of execution of S1 and S2 be $T_{S_{1}}$ and $T_{S_{2}}$ then the execution time for sequential form of above code is $n*(T_{S_{1}}+T_{S_{2}})$ , Now because we split the two statements and put them in two different loops, gives us an execution time of $n*T_{S_{1}}+T_{S_{2}}$ . We call this type of parallelism either function or task parallelism.

DOALL parallelism

DOALL parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence).^[1] For example, the following code does not read from the array a, and does not update the arrays b, c. No iterations have a dependence on any other iteration.

for(inti=0;i<n;++i){S1:a[i]=b[i]+c[i];}

Let's say the time of one execution of S1 be $T_{S_{1}}$ then the execution time for sequential form of above code is $n*T_{S_{1}}$ , Now because DOALL Parallelism exists when all iterations are independent, speed-up may be achieved by executing all iterations in parallel which gives us an execution time of $T_{S_{1}}$ , which is the time taken for one iteration in sequential execution.

The following example, using a simplified pseudo code, shows how a loop might be parallelized to execute each iteration independently.

begin_parallelism();for(inti=0;i<n;++i){S1:a[i]=b[i]+c[i];end_parallelism();}block();

DOACROSS parallelism

DOACROSS Parallelism exists where iterations of a loop are parallelized by extracting calculations that can be performed independently and running them simultaneously.^[5]

Synchronization exists to enforce loop-carried dependence.

Consider the following, synchronous loop with dependence S1[i] ->T S1[i+1].

for(inti=1;i<n;++i){a[i]=a[i-1]+b[i]+1;}

Each loop iteration performs two actions

Calculate a[i-1] + b[i] + 1
Assign the value to a[i]

Calculating the value a[i-1] + b[i] + 1, and then performing the assignment can be decomposed into two lines(statements S1 and S2):

S1:inttmp=b[i]+1;S2:a[i]=a[i-1]+tmp;

The first line, int tmp = b[i] + 1;, has no loop-carried dependence. The loop can then be parallelized by computing the temp value in parallel, and then synchronizing the assignment to a[i].

post(0);for(inti=1;i<n;++i){S1:inttmp=b[i]+1;wait(i-1);S2:a[i]=a[i-1]+tmp;post(i);}

Let's say the time of execution of S1 and S2 be $T_{S_{1}}$ and $T_{S_{2}}$ then the execution time for sequential form of above code is $n*(T_{S_{1}}+T_{S_{2}})$ , Now because DOACROSS Parallelism exists, speed-up may be achieved by executing iterations in a pipelined fashion which gives us an execution time of $T_{S_{1}}+n*T_{S_{2}}$ .

DOPIPE parallelism

DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops.^[1] The goal of DOPIPE is to act like an assembly line, where one stage is started as soon as there is sufficient data available for it from the previous stage.^[6]

Consider the following, synchronous code with dependence S1[i] ->T S1[i+1].

for(inti=1;i<n;++i){S1:a[i]=a[i-1]+b[i];S2:c[i]+=a[i];}

S1 must be executed sequentially, but S2 has no loop-carried dependence. S2 could be executed in parallel using DOALL Parallelism after performing all calculations needed by S1 in series. However, the speedup is limited if this is done. A better approach is to parallelize such that the S2 corresponding to each S1 executes when said S1 is finished.

Implementing pipelined parallelism results in the following set of loops, where the second loop may execute for an index as soon as the first loop has finished its corresponding index.

for(inti=1;i<n;++i){S1:a[i]=a[i-1]+b[i];post(i);}for(inti=1;i<n;i++){wait(i);S2:c[i]+=a[i];}

Let's say the time of execution of S1 and S2 be $T_{S_{1}}$ and $T_{S_{2}}$ then the execution time for sequential form of above code is $n*(T_{S_{1}}+T_{S_{2}})$ , Now because DOPIPE Parallelism exists, speed-up may be achieved by executing iterations in a pipelined fashion which gives us an execution time of $n*T_{S_{1}}+(n/p)*T_{S_{2}}$ , where $p$ is the number of processor in parallel.

Related Research Articles

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Release consistency is one of the synchronization-based consistency models used in concurrent programming.

Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2. Broadly, there are two classes of dependencies--control dependencies and data dependencies.

In computer science, loop dependence analysis is a process which can be used to find dependencies within iterations of a loop with the goal of determining different relationships between statements. These dependent relationships are tied to the order in which different statements access memory locations. Using the analysis of these relationships, execution of the loop can be organized to allow multiple processors to work on different portions of the loop in parallel. This is known as parallel processing. In general, loops can consume a lot of processing time when executed as serial code. Through parallel processing, it is possible to reduce the total execution time of a program through sharing the processing load among multiple processors.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.

Thread Level Speculation (TLS), also known as Speculative Multi-threading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

In compiler theory, a greatest common divisor test is the test used in study of loop optimization and loop dependence analysis to test the dependency between loop statements.

Software is said to exhibit scalable parallelism if it can make use of additional processors to solve larger problems, i.e. this term refers to software for which Gustafson's law holds. Consider a program whose execution time is dominated by one or more loops, each of that updates every element of an array --- for example, the following finite difference heat equation stencil calculation:

for t := 0 to T dofor i := 1 to N-1 do  new(i) := * .25  // explicit forward-difference with R = 0.25  endfor i := 1 to N-1 do  A(i) := new(i)  endend

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

Lyra2 is a password hashing scheme (PHS) that can also work as a key derivation function (KDF). It received a special recognition during the Password Hashing Competition in July 2015, which was won by Argon2. It is also used in proof-of-work algorithms such as Lyra2REv2, adopted by Vertcoin, MonaCoin, among other cryptocurrencies. Lyra2 was designed by Marcos A. Simplicio Jr., Leonardo C. Almeida, Ewerton R. Andrade, Paulo C. F. dos Santos, and Paulo S. L. M. Barreto from Escola Politécnica da Universidade de São Paulo. It is an improvement over Lyra, previously proposed by the same authors. Lyra2 preserves the security, efficiency and flexibility of its predecessor, including: (1) the ability to configure the desired amount of memory, processing time and parallelism to be used by the algorithm; and (2) the capacity of providing a high memory usage with a processing time similar to that obtained with scrypt. In addition, it brings the following improvements when compared to its predecessor:

DOPIPE parallelism is a method to perform loop-level parallelism by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends upon the programmers' ability to make best use of this concept. It also depends upon factors like identifying and separating the independent tasks and executing them parallelly.

Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously.

DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times.

In computer science, interference freedom is a technique for proving partial correctness of concurrent programs with shared variables. Hoare logic had been introduced earlier to prove correctness of sequential programs. In her PhD thesis under advisor David Gries, Susan Owicki extended this work to apply to concurrent programs.

Control dependency is a situation in which a program instruction executes if the previous instruction evaluates in a way that allows its execution.

References

1 2 3 4 5 Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.
↑ Goff, Gina (1991). "Practical dependence testing". Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation - PLDI '91. pp. 15–29. doi:10.1145/113445.113448. ISBN 0897914287. S2CID 2357293.
↑ Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops" (PDF). University of Cambridge. Retrieved 10 September 2016.
↑ Kavi, Krishna. "Parallelization of DOALL and DOACROSS Loops-a Survey".{{cite journal}}: Cite journal requires |journal= (help)
↑ Unnikrishnan, Priya (2012), "A Practical Approach to DOACROSS Parallelization", Euro-Par 2012 Parallel Processing, Lecture Notes in Computer Science, vol. 7484, pp. 219–231, doi: 10.1007/978-3-642-32820-6_23 , ISBN 978-3-642-32819-0, S2CID 18571258
↑ "DoPipe: An Effective Approach to Parallelize Simulation" (PDF). Intel. Retrieved 13 September 2016.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Solihin-1] 1 2 3 4 5 Solihin, Yan (2016). Fundamentals of Parallel Architecture. Boca Raton, FL: CRC Press. ISBN 978-1-4822-1118-4.

[2] Goff, Gina (1991). "Practical dependence testing". Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation - PLDI '91. pp. 15–29. doi:10.1145/113445.113448. ISBN 0897914287. S2CID 2357293.

[Parallelism_in_DOACROSS_Loops-3] Murphy, Niall. "Discovering and exploiting parallelism in DOACROSS loops" (PDF). University of Cambridge. Retrieved 10 September 2016.

[4] Kavi, Krishna. "Parallelization of DOALL and DOACROSS Loops-a Survey".{{cite journal}}: Cite journal requires |journal= (help)

[5] Unnikrishnan, Priya (2012), "A Practical Approach to DOACROSS Parallelization", Euro-Par 2012 Parallel Processing, Lecture Notes in Computer Science, vol. 7484, pp. 219–231, doi: 10.1007/978-3-642-32820-6_23 , ISBN 978-3-642-32819-0, S2CID 18571258

[6] "DoPipe: An Effective Approach to Parallelize Simulation" (PDF). Intel. Retrieved 13 September 2016.

[1]

[2]

[3]

[4]

[5]

[6]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing