Single program, multiple data

Last updated January 26, 2024 • 8 min readFrom Wikipedia, The Free Encyclopedia

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

by Michel Auguin (University of Nice Sophia-Antipolis) and François Larbey (Thomson/Sintra),^[1]^[2]^[3] as a “fork-and-join” and data-parallel approach where the parallel tasks (“single program”) are split-up and run simultaneously in lockstep on multiple SIMD processors with different inputs, and
by Frederica Darema (IBM),^[4]^[5]^[6] where “all (processors) processes begin executing the same program... but through synchronization directives ... self-schedule themselves to execute different instructions and act on different data” and enabling MIMD parallelization of a given program, and is a more general approach than data-parallel and more efficient than the fork-and-join for parallel execution on general purpose multiprocessors.

The (IBM) SPMD is the most common style of parallel programming and can be considered a subcategory of MIMD in that it refers to MIMD execution of a given (“single”) program.^[7] It is also a prerequisite for research concepts such as active messages and distributed shared memory.

SPMD vs SIMD

An example of "Single program, multiple data".

In SPMD parallel execution, multiple autonomous processors simultaneously execute the same program at independent points, rather than in the lockstep that SIMD or SIMT imposes on different data. With SPMD, tasks can be executed on general purpose CPUs. In SIMD the same operation (instruction) is applied on multiple data to manipulate data streams (a version of SIMD is vector processing where the data are organized as vectors). Another class of processors, GPUs encompass multiple SIMD streams processing. Note that SPMD and SIMD are not mutually exclusive; SPMD parallel execution can include SIMD, or vector, or GPU sub-processing. SPMD has been used for parallel programming of both message passing and shared-memory machine architectures.

Distributed memory

On distributed memory computer architectures, SPMD implementations usually employ message passing programming. A distributed memory computer consists of a collection of interconnected, independent computers, called nodes. For parallel execution, each node starts its own program and communicates with other nodes by sending and receiving messages, calling send/receive routines for that purpose. Other parallelization directives such as Barrier synchronization may also be implemented by messages. The messages can be sent by a number of communication mechanisms, such as TCP/IP over Ethernet, or specialized high-speed interconnects such as Myrinet and Supercomputer Interconnect. For distributed memory environments, serial sections of the program can be implemented by identical computation of the serial section on all nodes rather than computing the result on one node and sending it to the others, if that improves performance by reducing communication overhead.

Nowadays, the programmer is isolated from the details of the message passing by standard interfaces, such as PVM and MPI.

Distributed memory is the programming style used on parallel supercomputers from homegrown Beowulf clusters to the largest clusters on the Teragrid, as well as present GPU-based supercomputers.

Shared memory

On a shared memory machine (a computer with several interconnected CPUs that access the same memory space), the sharing can be implemented in the context of either physically shared memory or logically shared (but physically distributed) memory; in addition to the shared memory, the CPUs in the computer system can also include local (or private) memory. For either of these contexts, synchronization can be enabled with hardware enabled primitives (such as compare-and-swap, or fetch-and-add. For machines that do not have such hardware support, locks can be used and data can be “exchanged” across processors (or, more generally, processes or threads) by depositing the sharable data in a shared memory area. When the hardware does not support shared memory, packing the data as a “message” is often the most efficient way to program (logically) shared memory computers with large number of processors, where the physical memory is local to processors and accessing memory of another processor takes longer. SPMD on a shared memory machine can be implemented by standard processes (heavyweight) or threads (lightweight).

Shared memory multiprocessing (both symmetric multiprocessing, SMP, and non-uniform memory access, NUMA) presents the programmer with a common memory space and the possibility to parallelize execution. With the (IBM) SPMD model the cooperating processors (or processes) take different paths through the program, using parallel directives (parallelization and synchronization directives, which can utilize compare-and-swap and fetch-and-add operations on shared memory synchronization variables), and perform operations on data in the shared memory (“shared data”); the processors (or processes) can also have access and perform operations on data in their local memory (“private data”). In distinction, with fork-and-join approaches, the program starts executing on one processor and the execution splits in a parallel region, which is started when parallel directives are encountered; in a parallel region, the processors execute a parallel task on different data. A typical example is the parallel DO loop, where different processors work on separate parts of the arrays involved in the loop. At the end of the loop, execution is synchronized (with soft- or hard-barriers^[6]), and processors (processes) continue to the next available section of the program to execute. The (IBM) SPMD has been implemented in the current standard interface for shared memory multiprocessing, OpenMP, which uses multithreading , usually implemented by lightweight processes, called threads.

Combination of levels of parallelism

Current computers allow exploiting many parallel modes at the same time for maximum combined effect. A distributed memory program using MPI may run on a collection of nodes. Each node may be a shared memory computer and execute in parallel on multiple CPUs using OpenMP. Within each CPU, SIMD vector instructions (usually generated automatically by the compiler) and superscalar instruction execution (usually handled transparently by the CPU itself), such as pipelining and the use of multiple parallel functional units, are used for maximum single CPU speed.

History

The acronym SPMD for “Single-Program Multiple-Data” has been used to describe two different computational models for exploiting parallel computing, and this is due to both terms being natural extensions of Flynn’s taxonomy.^[7] The two respective groups of researchers were unaware of each other’s use of the term SPMD to independently describe different models of parallel programming.

The term SPMD was proposed first in 1983 by Michel Auguin (University of Nice Sophia-Antipolis) and François Larbey (Thomson/Sintra) in the context of the OPSILA parallel computer and in the context of a fork-and-join and data parallel computational model approach.^[1] This computer consisted of a master (controller processor) and SIMD processors (or vector processor mode as proposed by Flynn). In Auguin’s SPMD model, the same (parallel) task (“same program”) is executed on different (SIMD) processors (“operating in lock-step mode”^[1] acting on a part (“slice”) of the data-vector. Specifically, in their 1985 paper^[2] (and similarly in^[3]^[1]) is stated: “we consider the SPMD (Single Program, Multiple Data) operating mode. This mode allows simultaneous execution of the same task (one per processor) but prevents data exchange between processors. Data exchanges are only performed under SIMD mode by means of vector assignments. We assume synchronizations are summed-up to switchings (sic) between SIMD and SPMD operatings (sic) modes using global fork-join primitives”).

Starting around the same timeframe (in late 1983 – early 1984), the SPMD term was proposed by Frederica Darema (at IBM at that time, and part of the RP3 group) to define a different SMPD computational model that she proposed,^[6]^[5]^[4] as a programming model which in the intervening years has been applied to a wide range of general-purpose high-performance computers (including RP3 - the 512-processor IBM Research Parallel Processor Prototype) and has led to the current parallel computing standards. The (IBM) SPMD programming model assumes a multiplicity of processors which operate cooperatively, all executing the same program but can take different paths through the program based on parallelization directives embedded in the program; and specifically as stated in^[6]^[5]^[4]^[9]^[10] “all processesparticipating in the parallel computation are created at the beginning of the execution and remain in existence until the end”, (the processors/processes) “execute different instructions and act on different data”, “the job(work) to be done by each process is allocated dynamically”, that is the processes “self-schedule themselves to execute different instructions and act on different data”, thus self-assign themselves to cooperate in execution of serial and parallel tasks (as well as replicate tasks) in the program. The notion process was used as a generalization of the term processor in the sense that multiple processes can execute on a processor (to for example exploit larger degrees of parallelism for more efficiency and load-balancing). The (IBM) SPMD model was proposed by Darema as an approach different and more efficient than the fork-and-join that was pursued by all others in the community at that time; it is also more general than just “data-parallel” computational model and can encompass fork&join (as a subcategory implementation). The original context of the (IBM) SPMD was the RP3 computer (the 512-prosessor IBM Research Parallel Processor Prototype), which supported general purpose computing, with both distributed and (logically) shared memory.^[9] The (IBM) SPMD model was implemented by Darema and IBM colleagues into the EPEX (Environment for Parallel Execution), one of the first prototype programming environments.^[6]^[5]^[4]^[9]^[10]^[11] The effectiveness of the (IBM) SPMD was demonstrated for a wide class of applications,^[9]^[4] and in 1988 was implemented in the IBM FORTRAN,^[12] the first vendor-product in parallel programming; and in MPI (1991 and on), OpenMP (1997 and on), and other environments which have adopted and cite the (IBM) SPMD Computational Model.

By the late 1980s, there were many distributed computers with proprietary message passing libraries. The first SPMD standard was PVM. The current de facto standard is MPI.

The Cray parallel directives were a direct predecessor of OpenMP.

Related Research Articles

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computing is a field of computer science that studies distributed systems.

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

In computer science, an instruction set architecture (ISA) is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966 and extended in 1972. The classification system has stuck, and it has been used as a tool in the design of modern processors and their functionalities. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming context has evolved as an extension of the classification system. Vector processing, covered by Duncan's taxonomy, is missing from Flynn's work because the Cray-1 was released in 1977: Flynn's second paper was published in 1972.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

Task parallelism is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data. A common type of task parallelism is pipelining, which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

References

1 2 3 4 M. Auguin, F. Larbey (1983). "OPSILA: an advanced SIMD for numerical analysis and signal processing". Microcomputers: Developments in Industry, Business, and Education / Ninth EUROMICRO Symposium on Microprocessing and Microprogramming, Pp 311-318 Madrid, September 13–16, 1983.
1 2 M. Auguin, F. Labrey (1985). "A Multi-processor SIMD Machine: OPSILA". K. Waldschmidt and B. Myhrhaug Eds, @EUROMICRO, 1985, Elsevier Science Publishers B. V. – North Holland.
1 2 Auguin, M.; Boeri, F.; Dalban, J.P; Vincent-Carrefour, A. (1987). "Experience Using a SIMD/SPMD Multiprocessor Architecture". Multiprocessing and Microprogramming. 21 (1–5): 171–178. doi:10.1016/0165-6074(87)90034-2.
1 2 3 4 5 Darema, Frederica (2001). "SPMD model: past, present and future, Recent Advances in Parallel Virtual Machine and Message Passing Interface". 8th European PVM/MPI Users' Group Meeting, Santorini/Thera, Greece, September 23–26, 2001. Lecture Notes in Computer Science 2131.
1 2 3 4 F. Darema-Rogers, D. A. George, V. A. Norton, and G. F. Pfister (1985). "A VM Parallel Environment". IBM/RC11225 (1/23/85) and IBM/RC11381(9/19/85).{{cite journal}}: CS1 maint: multiple names: authors list (link)
1 2 3 4 5 Darema, F.; George, D.A.; Norton, V.A.; Pfister, G.F. (1988). "A single-program-multiple-data computational model for EPEX/FORTRAN". Journal of Parallel Computing. 7: 11–24. doi:10.1016/0167-8191(88)90094-4.
1 2 Flynn, Michael (1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071. S2CID 18573685.
↑ Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers . C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.
1 2 3 4 Darema, Frederica (1987). "Applications Environment for the IBM Research Parallel Processor Prototype (RP3)". IBMRC12627 (3/27/87) and in Proceedings of the 1st International Conference on Supercomputing (ICS'87), by Springer-Verlag (1987).
1 2 Darema, Frederica (1988). "Parallel Applications Development for Shared Memory Systems". IBM/RC12229(1986) and in Parallel Systems and Computation, G. Paul and G. S. Almasi Eds, Elsevier Science Publishers B. V. (North Holland), 1988.
↑ J. M. Stone, F. Darema-Rogers, V. A. Norton, G. F. Pfister (1985). "Introduction to the VM/EPEX Preprocessor and Reference". IBM/RC11407(9/30/85) and IBM/RC11408 (9/30/85).{{cite journal}}: CS1 maint: multiple names: authors list (link)
↑ Toomey, L. J.; Plachy, E. C.; Scarborough, R. G.; Sahulka, R. J.; Shaw, J. F.; Shannon, A. W. (1988). "IBM Parallel FORTRAN". IBM Systems Journal. 27 (4): 416–435. doi:10.1147/sj.274.0416.

External links

Parallel job management and message passing
Single Program Multiple Data stream Archived 2004-06-04 at the Wayback Machine
SPMD
Distributed-memory programming Archived 2013-12-13 at the Wayback Machine

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 4 M. Auguin, F. Larbey (1983). "OPSILA: an advanced SIMD for numerical analysis and signal processing". Microcomputers: Developments in Industry, Business, and Education / Ninth EUROMICRO Symposium on Microprocessing and Microprogramming, Pp 311-318 Madrid, September 13–16, 1983.

[:1-2] 1 2 M. Auguin, F. Labrey (1985). "A Multi-processor SIMD Machine: OPSILA". K. Waldschmidt and B. Myhrhaug Eds, @EUROMICRO, 1985, Elsevier Science Publishers B. V. – North Holland.

[:2-3] 1 2 Auguin, M.; Boeri, F.; Dalban, J.P; Vincent-Carrefour, A. (1987). "Experience Using a SIMD/SPMD Multiprocessor Architecture". Multiprocessing and Microprogramming. 21 (1–5): 171–178. doi:10.1016/0165-6074(87)90034-2.

[:3-4] 1 2 3 4 5 Darema, Frederica (2001). "SPMD model: past, present and future, Recent Advances in Parallel Virtual Machine and Message Passing Interface". 8th European PVM/MPI Users' Group Meeting, Santorini/Thera, Greece, September 23–26, 2001. Lecture Notes in Computer Science 2131.

[:4-5] 1 2 3 4 F. Darema-Rogers, D. A. George, V. A. Norton, and G. F. Pfister (1985). "A VM Parallel Environment". IBM/RC11225 (1/23/85) and IBM/RC11381(9/19/85).{{cite journal}}: CS1 maint: multiple names: authors list (link)

[:5-6] 1 2 3 4 5 Darema, F.; George, D.A.; Norton, V.A.; Pfister, G.F. (1988). "A single-program-multiple-data computational model for EPEX/FORTRAN". Journal of Parallel Computing. 7: 11–24. doi:10.1016/0167-8191(88)90094-4.

[:6-7] 1 2 Flynn, Michael (1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071. S2CID 18573685.

[flynn-1972-8] Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers . C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.

[:7-9] 1 2 3 4 Darema, Frederica (1987). "Applications Environment for the IBM Research Parallel Processor Prototype (RP3)". IBMRC12627 (3/27/87) and in Proceedings of the 1st International Conference on Supercomputing (ICS'87), by Springer-Verlag (1987).

[:8-10] 1 2 Darema, Frederica (1988). "Parallel Applications Development for Shared Memory Systems". IBM/RC12229(1986) and in Parallel Systems and Computation, G. Paul and G. S. Almasi Eds, Elsevier Science Publishers B. V. (North Holland), 1988.

[11] J. M. Stone, F. Darema-Rogers, V. A. Norton, G. F. Pfister (1985). "Introduction to the VM/EPEX Preprocessor and Reference". IBM/RC11407(9/30/85) and IBM/RC11408 (9/30/85).{{cite journal}}: CS1 maint: multiple names: authors list (link)

[12] Toomey, L. J.; Plachy, E. C.; Scarborough, R. G.; Sahulka, R. J.; Shaw, J. F.; Shannon, A. W. (1988). "IBM Parallel FORTRAN". IBM Systems Journal. 27 (4): 416–435. doi:10.1147/sj.274.0416.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing