DOPIPE

Last updated

DOPIPE parallelism is a method to perform loop-level parallelism by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends upon the programmers' ability to make best use of this concept. It also depends upon factors like identifying and separating the independent tasks and executing them parallelly. [1]

Contents

Background

The main purpose of employing loop-level parallelism is to search and split sequential tasks of a program and convert them into parallel tasks without any prior information about the algorithm. Parts of data that are recurring and consume significant amount of execution time are good candidates for loop-level parallelism. Some common applications of loop-level parallelism are found in mathematical analysis that uses multiple-dimension matrices which are iterated in nested loops. [2]

There are different kind of parallelization techniques which are used on the basis of data storage overhead, degree of parallelization and data dependencies. Some of the known techniques are: DOALL , DOACROSS and DOPIPE.

DOALL: This technique is used where we can parallelize each iteration of the loop without any interaction between the iterations. Hence, the overall run-time gets reduced from N * T (for a serial processor, where T is the execution time for each iteration) to only T (since all the N iterations are executed in parallel).

DOACROSS: This technique is used wherever there is a possibility for data dependencies. Hence, we parallelize tasks in such a manner that all the data independent tasks are executed in parallel, but the dependent ones are executed sequentially. There is a degree of synchronization used to sync the dependent tasks across parallel processors.

Description

DOPIPE is a pipelined parallelization technique that is used in programs where each element produced during each iteration is consumed in the later iteration. The following example shows how to implement the DOPIPE technique to reduce the total execution time by breaking the tasks inside the loop and executing them in a pipelined manner. The breaking into tasks takes place in such a way that all the dependencies within the loop are unidirectional, i.e. the following iteration does not depend on the previous iteration.

Example

The program below shows a pseudocode [2] for DOPIPE parallelization.

In this code, we see that there are three tasks (F0, F1 and F2) inside a loop iterating over j for 1 to N. Following is a list of dependencies in the code:

F1[j] → T F1[j+1], implies that statement F1 in iteration j+1 must be executed after statement F1 in iteration j. This is also known as true dependency.

F1[j] → T F2[j], implies that statement F2 in iteration j must be executed after statement F1 in iteration j.

for (j=1; j<=N; j++) {     F0: o[j] = x[j] - a[j];     F1: z[j] = z[j-1] * 5;     F2: y[j] = z[j] * w[j]; } 

If this code would have been executed sequentially, then the total time consumed would be equal to N * (TF0 + TF1 + TF2), where TF0, TF1 and TF2 denote execution time for functions F0, F1 and F2 respectively per iteration. Now, if we parallelize the loop by pipelining the statements inside the loop in the following manner:

for (j=1; j<=N; j++) {     F0: o[j] = x[j] - a[j];             // DOALL parallelism }  for (j=1; j<=N; j++) {     F1: z[j] = z[j-1] * 5;              // DOPIPE parallelism     post(j);                            // The result of F1 is posted and available for use }  for (j=1; j<=N; j++) {     wait(j);                            // It waits till the F1 completes and produces the value z[j] to be used by F2     F2: y[j] = z[j] * w[j]; } 

Since, F0 is an independent function, i.e. it does not have any loop-carried dependency (no dependence on j+1 or j-1 iterations). Neither it has any dependency across other statements in the loop. Hence, we can completely separate out this function and run it parallelly using DOALL parallelism. On the other hand, Statements F1 and F2 are dependent (explained above), therefore we split them into two different loops and execute them in a pipelined fashion. We use post(j) and wait(j) to synchronize between F1 and F2 loops.

Starting from the first iteration of j, statement F1 gets executed in TF1 time. Meanwhile, F2 is not getting executed since it is waiting for the value z[j] to be produced by F1. When F1 completes its execution for iteration j, it posts the value using post(j). After waiting for F1's execution, using wait(j), F2 starts its execution since it has the value z[j] available for use. Also, since F1's execution is not restricted by F2, hence F1 executes j+1 simultaneously. The figure below shows the execution timeline of all the statements.

DOPIPE example Example for DOPIPE.png
DOPIPE example

From the figure, we see that the total time to execute F0 is TF0, since all the iterations of F0 are executed in parallel. While for F1 and F2, the total execution time is equal to N * TF1 + TF2 (considering negligible synchronization time).

This is considerably less than the time obtained during sequential execution.

Comparison with other models

DOALL parallelism mainly works on principle of divide and conquer. Here all the tasks run in different iterations that use unique set of data. So the problem with this implementation is that when large amount of data works is computed together, a large cache space is needed to work for different threads. Since there is no dependencies in the threads, there is no overhead for the inter - thread communication.

While in DOPIPE, there is a synchronization overhead between the threads. But, due to its pipelined structure, it requires less cache space because the data produced is immediately consumed by the consumer. [2]

See also

Related Research Articles

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

In compiler theory, loop optimization is the process of increasing execution speed and reducing the overheads associated with loops. It plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a scientific program is spent on loops; as such, many compiler optimization techniques have been developed to make them faster.

In computer science, loop dependence analysis is a process which can be used to find dependencies within iterations of a loop with the goal of determining different relationships between statements. These dependent relationships are tied to the order in which different statements access memory locations. Using the analysis of these relationships, execution of the loop can be organized to allow multiple processors to work on different portions of the loop in parallel. This is known as parallel processing. In general, loops can consume a lot of processing time when executed as serial code. Through parallel processing, it is possible to reduce the total execution time of a program through sharing the processing load among multiple processors.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

Thread Level Speculation (TLS), also known as Speculative Multithreading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. A common type of task parallelism is pipelining, which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

<span class="mw-page-title-main">Parallel Extensions</span>

Parallel Extensions was the development name for a managed concurrency library developed by a collaboration between Microsoft Research and the CLR team at Microsoft. The library was released in version 4.0 of the .NET Framework. It is composed of two parts: Parallel LINQ (PLINQ) and Task Parallel Library (TPL). It also consists of a set of coordination data structures (CDS) – sets of data structures used to synchronize and co-ordinate the execution of concurrent tasks.

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

Pipelining is an important technique used in several applications such as digital signal processing (DSP) systems, microprocessors, etc. It originates from the idea of a water pipe with continuous water sent in without waiting for the water in the pipe to come out. Accordingly, it results in speed enhancement for the critical path in most DSP systems. For example, it can either increase the clock speed or reduce the power consumption at the same speed in a DSP system.

Privatization is a technique used in shared-memory programming to enable parallelism, by removing dependencies that occur across different threads in a parallel program. Dependencies between threads arise from two or more threads reading or writing a variable at the same time. Privatization gives each thread a private copy, so it can read and write it independently and thus, simultaneously.

DOACROSS parallelism is a parallelization technique used to perform Loop-level parallelism by utilizing synchronisation primitives between statements in a loop. This technique is used when a loop cannot be fully parallelized by DOALL parallelism due to data dependencies between loop iterations, typically loop-carried dependencies. The sections of the loop which contain loop-carried dependence are synchronized, while treating each section as a parallel task on its own. Therefore, DOACROSS parallelism can be used to complement DOALL parallelism to reduce loop execution times.

References

  1. Pankratius, Victor; Adl-Tabatabai, Ali-Reza; Tichy, Walter (2011). Fundamentals of Multicore Software Development. CRC press. ISBN   9781439812747.
  2. 1 2 3 Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Chapman and Hall/CRC. ISBN   9781482211191.