DOACROSS parallelism

Last updated May 02, 2024

DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times.

Description

DOACROSS parallelism is particularly useful when one statement depends on the values generated by another statement. In such a loop, DOALL parallelism can not be implemented in a straightforward manner. If the first statement blocks the execution of the second statement until the required value has been produced, then the two statements would be able to execute independent of each other (i.e.), each of the aforementioned statements would be parallelized for simultaneous execution^[1] using DOALL parallelism.

The following pseudocode illustrates the operation of DOACROSS parallelism in such a situation.^[2]

Example

for(inti=0;i<N;i++){a[i]=a[i-2]+b[i]*c[i]+d[i]/e[i]+1;}

In this example, each iteration of the loop requires the value written into a by the previous iteration. However, the entire statement is not dependent on the previous iteration, but only a portion of it. The statement is split into two blocks to illustrate this.

post(0);for(inti=0;i<N;i++){temp=b[i]*c[i]+d[i]/e[i]+1;wait(i-2);a[i]=a[i-2]+temp;post(i);}

The first statement has no loop carried dependence now, and the result of this statement is stored in the variable temp. The post () command is used to signal that the required result has been produced for utilization by other threads. The wait (i-2) command waits for the value a[i-2] before unblocking. The execution time of DOACROSS parallelism largely depends on what fraction of the program suffers from loop-carried dependence. Larger gains are observed when a sizable portion of the loop is affected by loop-carried dependence.^[2]

Drawbacks

DOACROSS parallelism suffers from significant space and granularity overheads due to the synchronization primitives used. Modern day compilers often overlook this method because of this major disadvantage.^[1] The overheads may be reduced by reducing the frequency of synchronization across the loop, by applying the primitives for groups of statements at a time.^[2]

Related Research Articles

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines. Put more simply, it tries to do the following without changing the meaning of the code:

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In computer science, loop dependence analysis is a process which can be used to find dependencies within iterations of a loop with the goal of determining different relationships between statements. These dependent relationships are tied to the order in which different statements access memory locations. Using the analysis of these relationships, execution of the loop can be organized to allow multiple processors to work on different portions of the loop in parallel. This is known as parallel processing. In general, loops can consume a lot of processing time when executed as serial code. Through parallel processing, it is possible to reduce the total execution time of a program through sharing the processing load among multiple processors.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement. In compiler theory, the technique used to discover data dependencies among statements is called dependence analysis.

In computer science, synchronization is the task of coordinating multiple of processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

Memory dependence prediction is a technique, employed by high-performance out-of-order execution microprocessors that execute memory access operations out of program order, to predict true dependencies between loads and stores at instruction execution time. With the predicted dependence information, the processor can then decide to speculatively execute certain loads and stores out of order, while preventing other loads and stores from executing out-of-order. Later in the pipeline, memory disambiguation techniques are used to determine if the loads and stores were correctly executed and, if not, to recover.

The Sieve C++ Parallel Programming System is a C++ compiler and parallel runtime designed and released by Codeplay that aims to simplify the parallelization of code so that it may run efficiently on multi-processor or multi-core systems. It is an alternative to other well-known parallelisation methods such as OpenMP, the RapidMind Development Platform and Threading Building Blocks (TBB).

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

Use of the polyhedral model within a compiler requires software to represent the objects of this framework and perform operations upon them.

OpenHMPP - programming standard for heterogeneous computing. Based on a set of compiler directives, standard is a programming model designed to handle hardware accelerators without the complexity associated with GPU programming. This approach based on directives has been implemented because they enable a loose relationship between an application code and the use of a hardware accelerator (HWA).

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

DOPIPE parallelism is a method to perform loop-level parallelism by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends upon the programmers' ability to make best use of this concept. It also depends upon factors like identifying and separating the independent tasks and executing them parallelly.

Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously.

References

1 2 Unnikrishnan, Priya; Shirako, Jun; Barton, Kit; Chatterjee, Sanjay; Silvera, Raul; Sarkar, Vivek (2012-08-27). Kaklamanis, Christos; Papatheodorou, Theodore; Spirakis, Paul G. (eds.). Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 219–231. doi:10.1007/978-3-642-32820-6_23. ISBN 9783642328190.
1 2 3 Solihin, Yan (2009). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Unnikrishnan, Priya; Shirako, Jun; Barton, Kit; Chatterjee, Sanjay; Silvera, Raul; Sarkar, Vivek (2012-08-27). Kaklamanis, Christos; Papatheodorou, Theodore; Spirakis, Paul G. (eds.). Euro-Par 2012 Parallel Processing. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 219–231. doi:10.1007/978-3-642-32820-6_23. ISBN 9783642328190.

[:1-2] 1 2 3 Solihin, Yan (2009). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC.

[1]

[2]