Embarrassingly parallel

Last updated

In parallel computing, an embarrassingly parallel workload or problem (also called perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem into a number of parallel tasks. [1] This is often the case where there is little or no dependency or need for communication between those parallel tasks, or for results between them. [2]

Parallel computing programming paradigm in which many calculations or the execution of processes are carried out simultaneously

Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but it's gaining broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.


Thus, these are different from distributed computing problems that need communication between tasks, especially communication of intermediate results. They are easy to perform on server farms which lack the special infrastructure used in a true supercomputer cluster. They are thus well suited to large, Internet-based distributed platforms such as BOINC, and do not suffer from parallel slowdown. The opposite of embarrassingly parallel problems are inherently serial problems, which cannot be parallelized at all.

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

Server farm Collection of computer servers

A server farm or server cluster is a collection of computer servers – usually maintained by an organization to supply server functionality far beyond the capability of a single machine. Server farms often consist of thousands of computers which require a large amount of power to run and to keep cool. At the optimum performance level, a server farm has enormous costs associated with it. Server farms often have backup servers, which can take over the function of primary servers in the event of a primary-server failure. Server farms are typically collocated with the network switches and/or routers which enable communication between the different parts of the cluster and the users of the cluster. Server farmers typically mount the computers, routers, power supplies, and related electronics on 19-inch racks in a server room or data center.

Supercomputer Extremely powerful computer for its era

A supercomputer is a computer with a high level of performance compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over a hundred quadrillion FLOPS. Since November 2017, all of the world's fastest 500 supercomputers run Linux-based operating systems. Additional research is being conducted in China, the United States, the European Union, Taiwan and Japan to build even faster, more powerful and technologically superior exascale supercomputers.

A common example of an embarrassingly parallel problem is 3D video rendering handled by a graphics processing unit, where each frame (forward method) or pixel (ray tracing method) can be handled with no interdependency. [3] Password cracking is another embarrassingly parallel task that is easily distributed on central processing units, CPU cores, or clusters.

Graphics processing unit specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel. In a personal computer, a GPU can be present on a video card or embedded on the motherboard. In certain CPUs, they are embedded on the CPU die.

Ray tracing (graphics) rendering method

In computer graphics, ray tracing is a rendering technique for generating an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects. The technique is capable of producing a very high degree of visual realism, usually higher than that of typical scanline rendering methods, but at a greater computational cost. This makes ray tracing best suited for applications where taking a relatively long time to render a frame can be tolerated, such as in still images and film and television visual effects, and more poorly suited for real-time applications such as video games where speed is critical. Ray tracing is capable of simulating a wide variety of optical effects, such as reflection and refraction, scattering, and dispersion phenomena.

In cryptanalysis and computer security, password cracking is the process which of recovering passwords from data that have been stored in or transmitted by a computer system. A common approach is to try guesses repeatedly for the password and check them against an available cryptographic hash of the password.


"Embarrassingly" is used here in the same sense as in the phrase "an embarrassment of riches", meaning an overabundance—here referring to parallelization problems which are "embarrassingly easy". [4] The term may also imply embarrassment on the part of developers or compilers: "Because so many important problems remain unsolved mainly due to their intrinsic computational complexity, it would be embarrassing not to develop parallel implementations of polynomial homotopy continuation methods." [5] The term is first found in the literature in a 1986 book on multiprocessors by MATLAB's creator Cleve Moler, [6] who claims to have invented the term. [7]

An embarrassment of riches is an idiom that means an overabundance of something, or too much of a good thing, that originated in 1738 as John Ozell's translation of a French play, L'Embarras des richesses (1726), by Léonor Jean Christine Soulas d'Allainval.

MATLAB multi-paradigm numerical computing environment

MATLAB is a multi-paradigm numerical computing environment and proprietary programming language developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, C#, Java, Fortran and Python.

Cleve Barry Moler is an American mathematician and computer programmer specializing in numerical analysis. In the mid to late 1970s, he was one of the authors of LINPACK and EISPACK, Fortran libraries for numerical computing. He invented MATLAB, a numerical computing package, to give his students at the University of New Mexico easy access to these libraries without writing Fortran. In 1984, he co-founded MathWorks with Jack Little to commercialize this program.

An alternative term, pleasingly parallel, has gained some use, perhaps to avoid the negative connotations of embarrassment in favor of a positive reflection on the parallelizability of the problems: "Of course, there is nothing embarrassing about these programs at all." [8]


Some examples of embarrassingly parallel problems include:

Numerical integration family of algorithms for calculating the numerical value of a definite integral

In analysis, numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral, and by extension, the term is also sometimes used to describe the numerical solution of differential equations. This article focuses on calculation of definite integrals. The term numerical quadrature is more or less a synonym for numerical integration, especially as applied to one-dimensional integrals. Some authors refer to numerical integration over more than one dimension as cubature; others take quadrature to include higher-dimensional integration.

Mandelbrot set Fractal named after mathematician Benoit Mandelbrot

The Mandelbrot set is the set of complex numbers for which the function does not diverge when iterated from , i.e., for which the sequence , , etc., remains bounded in absolute value.


See also

Related Research Articles

Symmetric multiprocessing multiprocessor architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single OS that treats all processors equally

Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

Beowulf cluster type of parallel computing cluster

A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.

In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can do multiple operations in a given time. It has been a tradition of computer science to describe serial algorithms in abstract machine models, often the one known as Random-access machine. Similarly, many computer science researchers have used a so-called parallel random-access machine (PRAM) as an parallel (shared-memory) abstract machine.

MIMD class of parallel computer architecture in Flynns taxonomy, in which multiple operations are performed on multiple data points simultaneously

In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.

Distributed memory

In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors. In contrast, a shared memory multiprocessor offers a single memory space used by all processors. Processors do not have to be aware where data resides, except that there may be performance penalties, and that race conditions are to be avoided.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. In addition, even a single GPU-CPU framework provides advantages that multiple CPUs on their own do not offer due to the specialization in each chip.

Concurrent computing is a form of computing in which several computations are executed during overlapping time periods—concurrently—instead of sequentially. This is a property of a system—this may be an individual program, a computer, or a network—and there is a separate execution point or "thread of control" for each computation ("process"). A concurrent system is one where a computation can advance without waiting for all other computations to complete.

In computer science, multiprocessor scheduling is an NP-hard optimization problem. The problem statement is: "Given a set J of jobs where job ji has length li and a number of processors m, what is the minimum possible time required to schedule all jobs in J on m processors such that none overlap?" The applications of this problem are numerous, but are, as suggested by the name of the problem, most strongly associated with the scheduling of computational tasks in a multiprocessor environment.

Shahid H. Bokhari is a highly cited Pakistani researcher in the field of parallel and distributed computing. He is a fellow of both IEEE and ACM. Bokhari's ACM Fellow citation states that he received the award for his "research contributions to automatic load balancing and partitioning of distributed processes", while his IEEE Fellow award recognises his "contributions to the mapping problem in parallel and distributed computing".

Intel iPSC

The Intel Personal SuperComputer was a product line of parallel computers in the 1980s and 1990s. The iPSC/1 was superseded by the Intel iPSC/2, and then the Intel iPSC/860.

Alewife was a cache coherent multiprocessor developed in the early 1990s by a group led by Anant Agarwal at the Massachusetts Institute of Technology. It was based on a network of up to 512 processing nodes, each of which used the Sparcle computer architecture, which was formed by modifying a Sun Microsystems SPARC CPU to include the APRIL techniques for fast context switches.

Computer cluster group of computers

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

Asymmetric multiprocessing

An asymmetric multiprocessing (AMP) system is a multiprocessor computer system where not all of the multiple interconnected central processing units (CPUs) are treated equally. For example, a system might allow only one CPU to execute operating system code or might allow only one CPU to perform I/O operations. Other AMP systems might allow any CPU to execute operating system code and perform I/O operations, so that they were symmetric with regard to processor roles, but attached some or all peripherals to particular CPUs, so that they were asymmetric with respect to the peripheral attachment.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

Tachyon (software) ray tracing software

Tachyon is a parallel/multiprocessor ray tracing software. It is a parallel ray tracing library for use on distributed memory parallel computers, shared memory computers, and clusters of workstations. Tachyon implements rendering features such as ambient occlusion lighting, depth-of-field focal blur, shadows, reflections, and others. It was originally developed for the Intel iPSC/860 by John Stone for his M.S. thesis at University of Missouri-Rolla. Tachyon subsequently became a more functional and complete ray tracing engine, and it is now incorporated into a number of other open source software packages such as VMD, and SageMath. Tachyon is released under a permissive license.

In computing, massively parallel refers to the use of a large number of processors to perform a set of coordinated computations in parallel (simultaneously).


  1. Herlihy, Maurice; Shavit, Nir (2012). The Art of Multiprocessor Programming, Revised Reprint (revised ed.). Elsevier. p. 14. ISBN   9780123977953 . Retrieved 28 February 2016. Some computational problems are “embarrassingly parallel”: they can easily be divided into components that can be executed concurrently.
  2. Section 1.4.4 of: Foster, Ian (1995). Designing and Building Parallel Programs. Addison–Wesley. ISBN   9780201575941. Archived from the original on 2011-03-01.
  3. Alan Chalmers; Erik Reinhard; Tim Davis (21 March 2011). Practical Parallel Rendering. CRC Press. ISBN   978-1-4398-6380-0.
  4. Matloff, Norman (2011). The Art of R Programming: A Tour of Statistical Software Design, p.347. No Starch. ISBN   9781593274108.
  5. Leykin, Anton; Verschelde, Jan; Zhuang, Yan (2006). Parallel Homotopy Algorithms to Solve Polynomial Systems. Proceedings of ICMS. Lecture Notes in Computer Science. 4151. pp. 225–234. doi:10.1007/11832225_22. ISBN   978-3-540-38084-9.
  6. Moler, Cleve (1986). Heath, Michael T. (ed.). Matrix Computation on Distributed Memory Multiprocessors. Hypercube Multiprocessors. Society for Industrial and Applied Mathematics, Philadelphia. ISBN   978-0898712094.
  7. The Intel hypercube part 2 reposted on Cleve's Corner blog on The MathWorks website
  8. Kepner, Jeremy (2009). Parallel MATLAB for Multicore and Multinode Computers, p.12. SIAM. ISBN   9780898716733.
  9. Erricos John Kontoghiorghes (21 December 2005). Handbook of Parallel Computing and Statistics. CRC Press. ISBN   978-1-4200-2868-3.
  10. Yuefan Deng (2013). Applied Parallel Computing. World Scientific. ISBN   978-981-4307-60-4.
  11. Simon, Josefsson; Colin, Percival (August 2016). "The scrypt Password-Based Key Derivation Function". tools.ietf.org. Retrieved 2016-12-12.
  12. SeqAnswers forum
  13. How we made our face recognizer 25 times faster (developer blog post)
  14. Shigeyoshi Tsutsui; Pierre Collet (5 December 2013). Massively Parallel Evolutionary Computation on GPGPUs. Springer Science & Business Media. ISBN   978-3-642-37959-8.
  15. Youssef Hamadi; Lakhdar Sais (5 April 2018). Handbook of Parallel Constraint Reasoning. Springer. ISBN   978-3-319-63516-3.
  16. Simple Network of Workstations (SNOW) package