Granularity (parallel computing)

Last updated May 26, 2025

In parallel computing, granularity (or grain size) of a task is a measure of the amount of work (or computation) which is performed by that task.^[1]

Another definition of granularity takes into account the communication overhead between multiple processors or processing elements. It defines granularity as the ratio of computation time to communication time, wherein computation time is the time required to perform the computation of a task and communication time is the time required to exchange data between processors.^[2]

If $T comp$ is the computation time and $T comm$ denotes the communication time, then the granularity $G$ of a task can be calculated as:^[2]

G={\frac {T_{\mathrm {comp} }}{T_{\mathrm {comm} }}}

Granularity is usually measured in terms of the number of instructions which are executed in a particular task.^[1] Alternately, granularity can also be specified in terms of the execution time of a program, combining the computation time and communication time.^[1]

Types of parallelism

Depending on the amount of work which is performed by a parallel task, parallelism can be classified into three categories: fine-grained, medium-grained and coarse-grained parallelism.

Fine-grained parallelism

In fine-grained parallelism, a program is broken down to a large number of small tasks. These tasks are assigned individually to many processors. The amount of work associated with a parallel task is low and the work is evenly distributed among the processors. Hence, fine-grained parallelism facilitates load balancing.^[3]

As each task processes less data, the number of processors required to perform the complete processing is high. This in turn, increases the communication and synchronization overhead.

Fine-grained parallelism is best exploited in architectures which support fast communication. Shared memory architecture which has a low communication overhead is most suitable for fine-grained parallelism.

It is difficult for programmers to detect parallelism in a program, therefore, it is usually the compilers' responsibility to detect fine-grained parallelism.^[1]

An example of a fine-grained system (from outside the parallel computing domain) is the system of neurons in our brain.^[4]

Connection Machine (CM-2) and J-Machine are examples of fine-grain parallel computers that have grain size in the range of 4-5 μs.^[1]

Coarse-grained parallelism

In coarse-grained parallelism, a program is split into large tasks. Due to this, a large amount of computation takes place in processors. This might result in load imbalance, wherein certain tasks process the bulk of the data while others might be idle. Further, coarse-grained parallelism fails to exploit the parallelism in the program as most of the computation is performed sequentially on a processor. The advantage of this type of parallelism is low communication and synchronization overhead.

Message-passing architecture takes a long time to communicate data among processes which makes it suitable for coarse-grained parallelism.^[1]

Cray Y-MP is an example of coarse-grained parallel computer which has a grain size of about 20s.^[1]

Medium-grained parallelism

Medium-grained parallelism is used relatively to fine-grained and coarse-grained parallelism. Medium-grained parallelism is a compromise between fine-grained and coarse-grained parallelism, where we have task size and communication time greater than fine-grained parallelism and lower than coarse-grained parallelism. Most general-purpose parallel computers fall in this category.^[4]

Intel iPSC is an example of medium-grained parallel computer which has a grain size of about 10ms.^[1]

Example

Consider a stack of 20 images with size 10x10 pixels that need to be processed, assuming that each of the 100 pixels can be processed independently of each other. Processing 1 pixel takes 1 clock cycle.

Fine-grained parallelism: Each pixel will be processed individually by one processor at a time. Assuming there are 100 processors that are responsible for processing the image, the 100 processors can process one 10x10 image in a single clock cycle. With 20 processors, it would take 5 clock cycles per image. Each processor can be utilized for 100% of its available time but the result of each pixel-computation needs to be communicated and aggregated at the end of each image processing which can cause a lot of overhead (100 communications per image = 2000 total).

Medium-grained parallelism: The images are split into quarters. Each quarter will be processed individually by one processor at a time taking 25 clock cycles (for 5x5 pixels). Assuming there are 20 processors that are responsible for processing the stack of 20 images, 5 images can be processed in parallel with 4 processors working on each image. If 100 processors were available, 80 could process the stack in parallel taking 25 clock cycles while 20 processors sit idle without any work assigned to them. Once the four quarters have been processed, the results must be aggregated (4 communications per image = 80 total).

Coarse-grained parallelism: A full image is processed by a single processor taking 100 clock cycles. In this case only 20 processors can be used at a time, completing the work in 100 clock cycles without any communication.

The decision on which approach is best depends on the workload and available processing units. The goal should be to maximize parallelization (split work into enough units to evenly distribute it across most available processors) while minimizing communication overhead (ratio of time spend on communication vs time spend on computation). In our example, if the number of pictures to process is high compared to the number of workers, it does not make sense to break images down into smaller units since each worker will receive enough load. If the number of pictures is small compared to the number of workers, some workers might sit idle and waste computation time. However, this only is a problem if processing a single image takes a long time. If the processing is very fast then splitting the work into smaller units might make the total operation slower since the time lost to communication is more than the time gained through parallelization.

Levels of parallelism

Granularity is closely tied to the level of processing. A program can be broken down into 4 levels of parallelism -

Instruction level.
Loop level
Sub-routine level and
Program-level

The highest amount of parallelism is achieved at instruction level, followed by loop-level parallelism. At instruction and loop level, fine-grained parallelism is achieved. Typical grain size at instruction-level is 20 instructions, while the grain-size at loop-level is 500 instructions.^[1]

At the sub-routine (or procedure) level the grain size is typically a few thousand instructions. Medium-grained parallelism is achieved at sub-routine level.^[1]

At program-level, parallel execution of programs takes place. Granularity can be in the range of tens of thousands of instructions.^[1] Coarse-grained parallelism is used at this level.

The below table shows the relationship between levels of parallelism, grain size and degree of parallelism

Levels	Grain Size	Parallelism
Instruction level	Fine	Highest
Loop level	Fine	Moderate
Sub-routine level	Medium	Moderate
Program level	Coarse	Least

Impact of granularity on performance

Granularity affects the performance of parallel computers. Using fine grains or small tasks results in more parallelism and hence increases the speedup. However, synchronization overhead, scheduling strategies etc. can negatively impact the performance of fine-grained tasks. Increasing parallelism alone cannot give the best performance.^[5]

In order to reduce the communication overhead, granularity can be increased. Coarse grained tasks have less communication overhead but they often cause load imbalance. Hence optimal performance is achieved between the two extremes of fine-grained and coarse-grained parallelism.^[6]

Various studies^[5]^[7]^[8] have proposed their solution to help determine the best granularity to aid parallel processing. Finding the best grain size, depends on a number of factors and varies greatly from problem-to-problem.

Citations

1 2 3 4 5 6 7 8 9 10 11 Hwang, Kai (1992). Advanced Computer Architecture: Parallelism, Scalability, Programmability (1st ed.). McGraw-Hill Higher Education. ISBN 978-0070316225.
1 2 Kwiatkowski, Jan (9 September 2001). "Evaluation of Parallel Programs by Measurement of Its Granularity". Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science. Vol. 2328. pp. 145–153. doi:10.1007/3-540-48086-2_16. ISBN 9783540437925. ISBN 9783540480860.
↑ Barney, Blaise. Introduction to Parallel Computing.
1 2 Miller, Russ; Stout, Quentin F. (1996). Parallel Algorithms for Regular Architectures: Meshes and Pyramids. Cambridge, Mass.: MIT Press. pp. 5–6. ISBN 9780262132336.
1 2 Chen, Ding-Kai; Su, Hong-Men; Yew, Pen-Chung (1 January 1990). "The impact of synchronization and granularity on parallel systems". Proceedings of the 17th annual international symposium on Computer Architecture - ISCA '90. Vol. 18. pp. 239–248. CiteSeerX 10.1.1.51.3389 . doi:10.1145/325164.325150. ISBN 0-89791-366-3. S2CID 16193537.
↑ Yeung, Donald; Dally, William J.; Agarwal, Anant. "How to Choose the Grain Size of a Parallel Computer". CiteSeerX 10.1.1.66.3298 .{{cite journal}}: Cite journal requires |journal= (help)
↑ McCreary, Carolyn; Gill, Helen (1 September 1989). "Automatic Determination of Grain Size for Efficient Parallel Processing". Commun. ACM. 32 (9): 1073–1078. doi: 10.1145/66451.66454 . ISSN 0001-0782. S2CID 14807217.
↑ Kruatrachue, Boontee; Lewis, Ted (1 January 1988). "Grain Size Determination for Parallel Processing" . IEEE Softw. 5 (1): 23–32. doi:10.1109/52.1991. ISSN 0740-7459. S2CID 2034255.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[hwang-1] 1 2 3 4 5 6 7 8 9 10 11 Hwang, Kai (1992). Advanced Computer Architecture: Parallelism, Scalability, Programmability (1st ed.). McGraw-Hill Higher Education. ISBN 978-0070316225.

[Kwiatkowski-2] 1 2 Kwiatkowski, Jan (9 September 2001). "Evaluation of Parallel Programs by Measurement of Its Granularity". Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science. Vol. 2328. pp. 145–153. doi:10.1007/3-540-48086-2_16. ISBN 9783540437925. ISBN 9783540480860.

[3] Barney, Blaise. Introduction to Parallel Computing.

[Russ-4] 1 2 Miller, Russ; Stout, Quentin F. (1996). Parallel Algorithms for Regular Architectures: Meshes and Pyramids. Cambridge, Mass.: MIT Press. pp. 5–6. ISBN 9780262132336.

[chen-5] 1 2 Chen, Ding-Kai; Su, Hong-Men; Yew, Pen-Chung (1 January 1990). "The impact of synchronization and granularity on parallel systems". Proceedings of the 17th annual international symposium on Computer Architecture - ISCA '90. Vol. 18. pp. 239–248. CiteSeerX 10.1.1.51.3389 . doi:10.1145/325164.325150. ISBN 0-89791-366-3. S2CID 16193537.

[6] Yeung, Donald; Dally, William J.; Agarwal, Anant. "How to Choose the Grain Size of a Parallel Computer". CiteSeerX 10.1.1.66.3298 .{{cite journal}}: Cite journal requires |journal= (help)

[7] McCreary, Carolyn; Gill, Helen (1 September 1989). "Automatic Determination of Grain Size for Efficient Parallel Processing". Commun. ACM. 32 (9): 1073–1078. doi: 10.1145/66451.66454 . ISSN 0001-0782. S2CID 14807217.

[8] Kruatrachue, Boontee; Lewis, Ted (1 January 1988). "Grain Size Determination for Parallel Processing" . IEEE Softw. 5 (1): 23–32. doi:10.1109/52.1991. ISSN 0740-7459. S2CID 2034255.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing