Granularity (parallel computing)

Last updated

In parallel computing, granularity (or grain size) of a task is a measure of the amount of work (or computation) which is performed by that task. [1]

Contents

Another definition of granularity takes into account the communication overhead between multiple processors or processing elements. It defines granularity as the ratio of computation time to communication time, wherein computation time is the time required to perform the computation of a task and communication time is the time required to exchange data between processors. [2]

If Tcomp is the computation time and Tcomm denotes the communication time, then the granularity G of a task can be calculated as: [2]

Granularity is usually measured in terms of the number of instructions which are executed in a particular task. [1] Alternately, granularity can also be specified in terms of the execution time of a program, combining the computation time and communication time. [1]

Types of parallelism

Depending on the amount of work which is performed by a parallel task, parallelism can be classified into three categories: fine-grained, medium-grained and coarse-grained parallelism.

Fine-grained parallelism

In fine-grained parallelism, a program is broken down to a large number of small tasks. These tasks are assigned individually to many processors. The amount of work associated with a parallel task is low and the work is evenly distributed among the processors. Hence, fine-grained parallelism facilitates load balancing. [3]

As each task processes less data, the number of processors required to perform the complete processing is high. This in turn, increases the communication and synchronization overhead.

Fine-grained parallelism is best exploited in architectures which support fast communication. Shared memory architecture which has a low communication overhead is most suitable for fine-grained parallelism.

It is difficult for programmers to detect parallelism in a program, therefore, it is usually the compilers' responsibility to detect fine-grained parallelism. [1]

An example of a fine-grained system (from outside the parallel computing domain) is the system of neurons in our brain. [4]

Connection Machine (CM-2) and J-Machine are examples of fine-grain parallel computers that have grain size in the range of 4-5 μs. [1]

Coarse-grained parallelism

In coarse-grained parallelism, a program is split into large tasks. Due to this, a large amount of computation takes place in processors. This might result in load imbalance, wherein certain tasks process the bulk of the data while others might be idle. Further, coarse-grained parallelism fails to exploit the parallelism in the program as most of the computation is performed sequentially on a processor. The advantage of this type of parallelism is low communication and synchronization overhead.

Message-passing architecture takes a long time to communicate data among processes which makes it suitable for coarse-grained parallelism. [1]

Cray Y-MP is an example of coarse-grained parallel computer which has a grain size of about 20s. [1]

Medium-grained parallelism

Medium-grained parallelism is used relatively to fine-grained and coarse-grained parallelism. Medium-grained parallelism is a compromise between fine-grained and coarse-grained parallelism, where we have task size and communication time greater than fine-grained parallelism and lower than coarse-grained parallelism. Most general-purpose parallel computers fall in this category. [4]

Intel iPSC is an example of medium-grained parallel computer which has a grain size of about 10ms. [1]

Example

Consider a 10*10 image that needs to be processed, given that, processing of the 100 pixels is independent of each other.

Fine-grained parallelism: Assume there are 100 processors that are responsible for processing the 10*10 image. Ignoring the communication overhead, the 100 processors can process the 10*10 image in 1 clock cycle. Each processor is working on 1 pixel of the image and then communicates the output to other processors. This is an example of fine-grained parallelism.

Medium-grained parallelism: Consider that there are 25 processors processing the 10*10 image. The processing of the image will now take 4 clock cycles. This is an example of medium-grained parallelism.

Coarse-grained parallelism: Further, if we reduce the processors to 2, then the processing will take 50 clock cycles. Each processor need to process 50 elements which increases the computation time, but the communication overhead decreases as the number of processors which share data decreases. This case illustrates coarse-grained parallelism.

Fine-grain : Pseudocode for 100 processorsMedium-grain : Pseudocode for 25 processorsCoarse-grain : Pseudocode for 2 processors
voidmain(){switch(Processor_ID){case1:Computeelement1;break;case2:Computeelement2;break;case3:Computeelement3;break;....case100:Computeelement100;break;}}
voidmain(){switch(Processor_ID){case1:Computeelements1-4;break;case2:Computeelements5-8;break;case3:Computeelements9-12;break;..case25:Computeelements97-100;break;}}
voidmain(){switch(Processor_ID){case1:Computeelements1-50;break;case2:Computeelements51-100;break;}}
Computation time - 1 clock cycleComputation time - 4 clock cyclesComputation time - 50 clock cycles

Levels of parallelism

Granularity is closely tied to the level of processing. A program can be broken down into 4 levels of parallelism -

  1. Instruction level.
  2. Loop level
  3. Sub-routine level and
  4. Program-level

The highest amount of parallelism is achieved at instruction level, followed by loop-level parallelism. At instruction and loop level, fine-grained parallelism is achieved. Typical grain size at instruction-level is 20 instructions, while the grain-size at loop-level is 500 instructions. [1]

At the sub-routine (or procedure) level the grain size is typically a few thousand instructions. Medium-grained parallelism is achieved at sub-routine level. [1]

At program-level, parallel execution of programs takes place. Granularity can be in the range of tens of thousands of instructions. [1] Coarse-grained parallelism is used at this level.

The below table shows the relationship between levels of parallelism, grain size and degree of parallelism

LevelsGrain SizeParallelism
Instruction levelFineHighest
Loop levelFineModerate
Sub-routine levelMediumModerate
Program levelCoarseLeast

Impact of granularity on performance

Granularity affects the performance of parallel computers. Using fine grains or small tasks results in more parallelism and hence increases the speedup. However, synchronization overhead, scheduling strategies etc. can negatively impact the performance of fine-grained tasks. Increasing parallelism alone cannot give the best performance. [5]

In order to reduce the communication overhead, granularity can be increased. Coarse grained tasks have less communication overhead but they often cause load imbalance. Hence optimal performance is achieved between the two extremes of fine-grained and coarse-grained parallelism. [6]

Various studies [5] [7] [8] have proposed their solution to help determine the best granularity to aid parallel processing. Finding the best grain size, depends on a number of factors and varies greatly from problem-to-problem.

See also

Citations

  1. 1 2 3 4 5 6 7 8 9 10 11 Hwang, Kai (1992). Advanced Computer Architecture: Parallelism, Scalability, Programmability (1st ed.). McGraw-Hill Higher Education. ISBN   978-0070316225.
  2. 1 2 Kwiatkowski, Jan (9 September 2001). "Evaluation of Parallel Programs by Measurement of Its Granularity". Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science. Vol. 2328. pp. 145–153. doi:10.1007/3-540-48086-2_16. ISBN   9783540437925. ISBN   9783540480860.
  3. Barney, Blaise. Introduction to Parallel Computing.
  4. 1 2 Miller, Russ; Stout, Quentin F. (1996). Parallel Algorithms for Regular Architectures: Meshes and Pyramids. Cambridge, Mass.: MIT Press. pp. 5–6. ISBN   9780262132336.
  5. 1 2 Chen, Ding-Kai; Su, Hong-Men; Yew, Pen-Chung (1 January 1990). "The Impact of Synchronization and Granularity on Parallel Systems". Proceedings of the 17th Annual International Symposium on Computer Architecture. 18 (2SI): 239–248. CiteSeerX   10.1.1.51.3389 . doi:10.1145/325164.325150. S2CID   16193537.
  6. Yeung, Donald; Dally, William J.; Agarwal, Anant. "How to Choose the Grain Size of a Parallel Computer". CiteSeerX   10.1.1.66.3298 .{{cite journal}}: Cite journal requires |journal= (help)
  7. McCreary, Carolyn; Gill, Helen (1 September 1989). "Automatic Determination of Grain Size for Efficient Parallel Processing". Commun. ACM. 32 (9): 1073–1078. doi: 10.1145/66451.66454 . ISSN   0001-0782. S2CID   14807217.
  8. Kruatrachue, Boontee; Lewis, Ted (1 January 1988). "Grain Size Determination for Parallel Processing". IEEE Softw. 5 (1): 23–32. doi:10.1109/52.1991. ISSN   0740-7459. S2CID   2034255.

Related Research Articles

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to add custom computational blocks using FPGAs. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric, thus providing new computational blocks without the need to manufacture and add new chips to the existing system.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exists multiple unique implementations for different applications.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Granularity is the degree to which a material or system is composed of distinguishable pieces, "granules" or "grains" (metaphorically). It can either refer to the extent to which a larger entity is subdivided, or the extent to which groups of smaller indistinguishable entities have joined together to become larger distinguishable entities.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

In computer programming, explicit parallelism is the representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls. Most parallel primitives are related to process synchronization, communication or task partitioning. As they seldom contribute to actually carry out the intended computation of the program, their computational cost is often considered as parallelization overhead.

SIMD within a register (SWAR), also known by the name "packed SIMD" is a technique for performing parallel operations on data contained in a processor register. SIMD stands for single instruction, multiple data. Flynn's 1972 taxonomy categorises SWAR as "pipelined processing".

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

<span class="mw-page-title-main">Plurality (company)</span>

Plurality Ltd. is an Israeli semiconductor company, the developer of the HyperCore technology and the HAL multi-core processor. The company is a member of the Multicore Association.

Software is said to exhibit scalable parallelism if it can make use of additional processors to solve larger problems, i.e. this term refers to software for which Gustafson's law holds. Consider a program whose execution time is dominated by one or more loops, each of that updates every element of an array --- for example, the following finite difference heat equation stencil calculation:

for t := 0 to T dofor i := 1 to N-1 do  new(i) := * .25  // explicit forward-difference with R = 0.25  endfor i := 1 to N-1 do  A(i) := new(i)  endend

In parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors. It does so efficiently in terms of execution time, memory usage, and inter-processor communication.

DOPIPE parallelism is a method to perform loop-level parallelism by pipelining the statements in a loop. Pipelined parallelism may exist at different levels of abstraction like loops, functions and algorithmic stages. The extent of parallelism depends upon the programmers' ability to make best use of this concept. It also depends upon factors like identifying and separating the independent tasks and executing them parallelly.