Symmetric multiprocessing

Last updated November 30, 2023

Symmetric multiprocessing or shared-memory multiprocessing^[1] (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Professor John D. Kubiatowicz considers traditionally SMP systems to contain processors without caches.^[2] Culler and Pal-Singh in their 1998 book "Parallel Computer Architecture: A Hardware/Software Approach" mention: "The term SMP is widely used but causes a bit of confusion. [...] The more precise description of what is intended by SMP is a shared memory multiprocessor where the cost of accessing a memory location is the same for all processors; that is, it has uniform access costs when the access actually is to memory. If the location is cached, the access will be faster, but cache access times and memory access times are the same on all processors."^[3]

SMP systems are tightly coupled multiprocessor systems with a pool of homogeneous processors running independently of each other. Each processor, executing different programs and working on different sets of data, has the capability of sharing common resources (memory, I/O device, interrupt system and so on) that are connected using a system bus or a crossbar.

Design

SMP systems have centralized shared memory called main memory (MM) operating under a single operating system with two or more homogeneous processors. Usually each processor has an associated private high-speed memory known as cache memory (or cache) to speed up the main memory data access and to reduce the system bus traffic.

Processors may be interconnected using buses, crossbar switches or on-chip mesh networks. The bottleneck in the scalability of SMP using buses or crossbar switches is the bandwidth and power consumption of the interconnect among the various processors, the memory, and the disk arrays. Mesh architectures avoid these bottlenecks, and provide nearly linear scalability to much higher processor counts at the sacrifice of programmability:

Serious programming challenges remain with this kind of architecture because it requires two distinct modes of programming; one for the CPUs themselves and one for the interconnect between the CPUs. A single programming language would have to be able to not only partition the workload, but also comprehend the memory locality, which is severe in a mesh-based architecture.^[4]

SMP systems allow any processor to work on any task no matter where the data for that task is located in memory, provided that each task in the system is not in execution on two or more processors at the same time. With proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.

History

The earliest production system with multiple identical processors was the Burroughs B5000, which was functional around 1961. However at run-time this was asymmetric, with one processor restricted to application programs while the other processor mainly handled the operating system and hardware interrupts. The Burroughs D825 first implemented SMP in 1962.^[5]^[6]

IBM offered dual-processor computer systems based on its System/360 Model 65 and the closely related Model 67 ^[7] and 67–2.^[8] The operating systems that ran on these machines were OS/360 M65MP^[9] and TSS/360. Other software developed at universities, notably the Michigan Terminal System (MTS), used both CPUs. Both processors could access data channels and initiate I/O. In OS/360 M65MP, peripherals could generally be attached to either processor since the operating system kernel ran on both processors (though with a "big lock" around the I/O handler).^[10] The MTS supervisor (UMMPS) has the ability to run on both CPUs of the IBM System/360 model 67–2. Supervisor locks were small and used to protect individual common data structures that might be accessed simultaneously from either CPU.^[11]

Other mainframes that supported SMP included the UNIVAC 1108 II, released in 1965, which supported up to three CPUs, and the GE-635 and GE-645,^[12]^[13] although GECOS on multiprocessor GE-635 systems ran in a master-slave asymmetric fashion, unlike Multics on multiprocessor GE-645 systems, which ran in a symmetric fashion.^[14]

Starting with its version 7.0 (1972), Digital Equipment Corporation's operating system TOPS-10 implemented the SMP feature, the earliest system running SMP was the DECSystem 1077 dual KI10 processor system.^[15] Later KL10 system could aggregate up to 8 CPUs in a SMP manner. In contrast, DECs first multi-processor VAX system, the VAX-11/782, was asymmetric,^[16] but later VAX multiprocessor systems were SMP.^[17]

Early commercial Unix SMP implementations included the Sequent Computer Systems Balance 8000 (released in 1984) and Balance 21000 (released in 1986).^[18] Both models were based on 10 MHz National Semiconductor NS32032 processors, each with a small write-through cache connected to a common memory to form a shared memory system. Another early commercial Unix SMP implementation was the NUMA based Honeywell Information Systems Italy XPS-100 designed by Dan Gielan of VAST Corporation in 1985. Its design supported up to 14 processors, but due to electrical limitations, the largest marketed version was a dual processor system. The operating system was derived and ported by VAST Corporation from AT&T 3B20 Unix SysVr3 code used internally within AT&T.

Earlier non-commercial multiprocessing UNIX ports existed, including a port named MUNIX created at the Naval Postgraduate School by 1975.^[19]

Uses

Time-sharing and server systems can often use SMP without changes to applications, as they may have multiple processes running in parallel, and a system with more than one process running can run different processes on different processors.

On personal computers, SMP is less useful for applications that have not been modified. If the system rarely runs more than one process at a time, SMP is useful only for applications that have been modified for multithreaded (multitasked) processing. Custom-programmed software can be written or modified to use multiple threads, so that it can make use of multiple processors.

Multithreaded programs can also be used in time-sharing and server systems that support multithreading, allowing them to make more use of multiple processors.

Advantages/disadvantages

In current SMP systems, all of the processors are tightly coupled inside the same box with a bus or switch; on earlier SMP systems, a single CPU took an entire cabinet. Some of the components that are shared are global memory, disks, and I/O devices. Only one copy of an OS runs on all the processors, and the OS must be designed to take advantage of this architecture. Some of the basic advantages involves cost-effective ways to increase throughput. To solve different problems and tasks, SMP applies multiple processors to that one problem, known as parallel programming.

However, there are a few limits on the scalability of SMP due to cache coherence and shared objects.

Programming

Uniprocessor and SMP systems require different programming methods to achieve maximum performance. Programs running on SMP systems may experience an increase in performance even when they have been written for uniprocessor systems. This is because hardware interrupts usually suspends program execution while the kernel that handles them can execute on an idle processor instead. The effect in most applications (e.g. games) is not so much a performance increase as the appearance that the program is running much more smoothly. Some applications, particularly building software and some distributed computing projects, run faster by a factor of (nearly) the number of additional processors. (Compilers by themselves are single threaded, but, when building a software project with multiple compilation units, if each compilation unit is handled independently, this creates an embarrassingly parallel situation across the entire multi-compilation-unit project, allowing near linear scaling of compilation time. Distributed computing projects are inherently parallel by design.)

Systems programmers must build support for SMP into the operating system, otherwise, the additional processors remain idle and the system functions as a uniprocessor system.

SMP systems can also lead to more complexity regarding instruction sets. A homogeneous processor system typically requires extra registers for "special instructions" such as SIMD (MMX, SSE, etc.), while a heterogeneous system can implement different types of hardware for different instructions/uses.

Performance

When more than one program executes at the same time, an SMP system has considerably better performance than a uni-processor, because different programs can run on different CPUs simultaneously. Conversely, asymmetric multiprocessing (AMP) usually allows only one processor to run a program or task at a time. For example, AMP can be used in assigning specific tasks to CPU based to priority and importance of task completion. AMP was created well before SMP in terms of handling multiple CPUs, which explains the lack of performance based on the example provided.

In cases where an SMP environment processes many jobs, administrators often experience a loss of hardware efficiency. Software programs have been developed to schedule jobs and other functions of the computer so that the processor utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each CPU separately, as well as being able to integrate multiple SMP machines and clusters.

Access to RAM is serialized; this and cache coherency issues causes performance to lag slightly behind the number of additional processors in the system.

Alternatives

SMP uses a single shared system bus that represents one of the earliest styles of multiprocessor machine architectures, typically used for building smaller computers with up to 8 processors.

Larger computer systems might use newer architectures such as NUMA (Non-Uniform Memory Access), which dedicates different memory banks to different processors. In a NUMA architecture, processors may access local memory quickly and remote memory more slowly. This can dramatically improve memory throughput as long as the data are localized to specific processes (and thus processors). On the downside, NUMA makes the cost of moving data from one processor to another, as in workload balancing, more expensive. The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users.

Finally, there is computer clustered multiprocessing (such as Beowulf), in which not all memory is available to all processors. Clustering techniques are used fairly extensively to build very large supercomputers.

Variable SMP

Variable Symmetric Multiprocessing (vSMP) is a specific mobile use case technology initiated by NVIDIA. This technology includes an extra fifth core in a quad-core device, called the Companion core, built specifically for executing tasks at a lower frequency during mobile active standby mode, video playback, and music playback.

Project Kal-El (Tegra 3),^[20] patented by NVIDIA, was the first SoC (System on Chip) to implement this new vSMP technology. This technology not only reduces mobile power consumption during active standby state, but also maximizes quad core performance during active usage for intensive mobile applications. Overall this technology addresses the need for increase in battery life performance during active and standby usage by reducing the power consumption in mobile processors.

Unlike current SMP architectures, the vSMP Companion core is OS transparent meaning that the operating system and the running applications are totally unaware of this extra core but are still able to take advantage of it. Some of the advantages of the vSMP architecture includes cache coherency, OS efficiency, and power optimization. The advantages for this architecture are explained below:

Cache coherency: There are no consequences for synchronizing caches between cores running at different frequencies since vSMP does not allow the companion core and the main cores to run simultaneously.
OS efficiency: It is inefficient when multiple CPU cores are run at different asynchronous frequencies because this could lead to possible scheduling issues.^{[ how? ]} With vSMP, the active CPU cores will run at similar frequencies to optimize OS scheduling.
Power optimization: In asynchronous clocking based architecture, each core is on a different power plane to handle voltage adjustments for different operating frequencies. The result of this could impact performance.^{[ how? ]} vSMP technology is able to dynamically enable and disable certain cores for active and standby usage, reducing overall power consumption.

These advantages lead the vSMP architecture to considerably benefit^{[ peacock prose ]} over other architectures using asynchronous clocking technologies.

Related Research Articles

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Computer multitasking</span> Concurrent execution of multiple processes

In computing, multitasking is the concurrent execution of multiple tasks over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result, a computer executes segments of multiple tasks in an interleaved manner, while the tasks share common processing resources such as central processing units (CPUs) and main memory. Multitasking automatically interrupts the running program, saving its state and loading the saved state of another program and transferring control to it. This "context switch" may be initiated at fixed time intervals, or the running program may be coded to signal to the supervisory software when it can be interrupted.

Multiple Virtual Storage, more commonly called MVS, is the most commonly used operating system on the System/370, System/390 and IBM Z IBM mainframe computers. IBM developed MVS, along with OS/VS1 and SVS, as a successor to OS/360. It is unrelated to IBM's other mainframe operating system lines, e.g., VSE, VM, TPF.

In processor design, microcode serves as an intermediary layer situated between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer, also known as its machine code. It consists of a set of hardware-level instructions that implement the higher-level machine code instructions or control internal finite-state machine sequencing in many digital processing components. While microcode is utilized in general-purpose CPUs in contemporary desktops, it also functions as a fallback path for scenarios that the faster hardwired control unit is unable to manage.

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

Sequent Computer Systems was a computer company that designed and manufactured multiprocessing computer systems. They were among the pioneers in high-performance symmetric multiprocessing (SMP) open systems, innovating in both hardware and software.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

OS/360, officially known as IBM System/360 Operating System, is a discontinued batch processing operating system developed by IBM for their then-new System/360 mainframe computer, announced in 1964; it was influenced by the earlier IBSYS/IBJOB and Input/Output Control System (IOCS) packages for the IBM 7090/7094 and even more so by the PR155 Operating System for the IBM 1410/7010 processors. It was one of the earliest operating systems to require the computer hardware to include at least one direct access storage device.

The z10 is a microprocessor chip made by IBM for their System z10 mainframe computers, released February 26, 2008. It was called "z6" during development.

An asymmetric multiprocessing system is a multiprocessor computer system where not all of the multiple interconnected central processing units (CPUs) are treated equally. For example, a system might allow only one CPU to execute operating system code or might allow only one CPU to perform I/O operations. Other AMP systems might allow any CPU to execute operating system code and perform I/O operations, so that they were symmetric with regard to processor roles, but attached some or all peripherals to particular CPUs, so that they were asymmetric with respect to the peripheral attachment.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

A multiprocessor system is defined as "a system with more than one processor", and, more precisely, "a number of central processing units linked together to enable parallel processing to take place".

References

↑ Patterson, David; Hennessy, John (2018). Computer Organisation and Design: The Hardware/Software Interface (RISC-V ed.). Cambridge, United States: Morgan Kaufmann. p. 509. ISBN 978-0-12-812275-4.
↑ John Kubiatowicz. Introduction to Parallel Architectures and Pthreads. 2013 Short Course on Parallel Programming.
↑ David Culler; Jaswinder Pal Singh; Anoop Gupta (1999). Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. p. 47. ISBN 978-1-55860-343-1.
↑ Lina J. Karam; Ismail AlKamal; Alan Gatherer; Gene A. Frantz; David V. Anderson; Brian L. Evans (2009). "Trends in Multi-core DSP Platforms" (PDF). IEEE Signal Processing Magazine. 26 (6): 38–49. Bibcode:2009ISPM...26...38K. doi:10.1109/MSP.2009.934113. S2CID 9429714.
↑ Gregory V. Wilson (October 1994). "The History of the Development of Parallel Computing".
↑ Martin H. Weik (January 1964). "A Fourth Survey of Domestic Electronic Digital Computing Systems". Ballistic Research Laboratories, Aberdeen Proving Grounds. Burroughs D825.
↑ IBM System/360 Model 65 Functional Characteristics (PDF). Fourth Edition. IBM. September 1968. A22-6884-3.
↑ IBM System/360 Model 67 Functional Characteristics (PDF). Third Edition. IBM. February 1972. GA27-2719-2.
↑ M65MP: An Experiment in OS/360 multiprocessing
↑ Program Logic Manual, OS I/O Supervisor Logic, Release 21 (R21.7) (PDF) (Tenth ed.). IBM. April 1973. GY28-6616-9.
↑ Time Sharing Supervisor Programs by Mike Alexander (May 1971) has information on MTS, TSS, CP/67, and Multics
↑ GE-635 System Manual (PDF). General Electric. July 1964.
↑ GE-645 System Manual (PDF). General Electric. January 1968.
↑ Richard Shetron (May 5, 1998). "Fear of Multiprocessing?". Newsgroup: alt.folklore.computers. Usenet: 354e95a9.0@news.wizvax.net.
↑ DEC 1077 and SMP
↑ VAX Product Sales Guide, pages 1-23 and 1-24: the VAX-11/782 is described as an asymmetric multiprocessing system in 1982
↑ VAX 8820/8830/8840 System Hardware User's Guide: by 1988 the VAX operating system was SMP
↑ Hockney, R.W.; Jesshope, C.R. (1988). Parallel Computers 2: Architecture, Programming and Algorithms. Taylor & Francis. p. 46. ISBN 0-85274-811-6.
↑ Hawley, John Alfred (June 1975). "MUNIX, A Multiprocessing Version Of UNIX" (PDF). core.ac.uk. Retrieved 11 November 2018.
↑ Variable SMP – A Multi-Core CPU Architecture for Low Power and High Performance. NVIDIA. 2011.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Patterson, David; Hennessy, John (2018). Computer Organisation and Design: The Hardware/Software Interface (RISC-V ed.). Cambridge, United States: Morgan Kaufmann. p. 509. ISBN 978-0-12-812275-4.

[2] John Kubiatowicz. Introduction to Parallel Architectures and Pthreads. 2013 Short Course on Parallel Programming.

[3] David Culler; Jaswinder Pal Singh; Anoop Gupta (1999). Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. p. 47. ISBN 978-1-55860-343-1.

[AutoMQ-1-4] Lina J. Karam; Ismail AlKamal; Alan Gatherer; Gene A. Frantz; David V. Anderson; Brian L. Evans (2009). "Trends in Multi-core DSP Platforms" (PDF). IEEE Signal Processing Magazine. 26 (6): 38–49. Bibcode:2009ISPM...26...38K. doi:10.1109/MSP.2009.934113. S2CID 9429714.

[5] Gregory V. Wilson (October 1994). "The History of the Development of Parallel Computing".

[6] Martin H. Weik (January 1964). "A Fourth Survey of Domestic Electronic Digital Computing Systems". Ballistic Research Laboratories, Aberdeen Proving Grounds. Burroughs D825.

[7] IBM System/360 Model 65 Functional Characteristics (PDF). Fourth Edition. IBM. September 1968. A22-6884-3.

[8] IBM System/360 Model 67 Functional Characteristics (PDF). Third Edition. IBM. February 1972. GA27-2719-2.

[9] M65MP: An Experiment in OS/360 multiprocessing

[10] Program Logic Manual, OS I/O Supervisor Logic, Release 21 (R21.7) (PDF) (Tenth ed.). IBM. April 1973. GY28-6616-9.

[11] Time Sharing Supervisor Programs by Mike Alexander (May 1971) has information on MTS, TSS, CP/67, and Multics

[12] GE-635 System Manual (PDF). General Electric. July 1964.

[13] GE-645 System Manual (PDF). General Electric. January 1968.

[14] Richard Shetron (May 5, 1998). "Fear of Multiprocessing?". Newsgroup: alt.folklore.computers. Usenet: 354e95a9.0@news.wizvax.net.

[15] DEC 1077 and SMP

[16] VAX Product Sales Guide, pages 1-23 and 1-24: the VAX-11/782 is described as an asymmetric multiprocessing system in 1982

[17] VAX 8820/8830/8840 System Hardware User's Guide: by 1988 the VAX operating system was SMP

[18] Hockney, R.W.; Jesshope, C.R. (1988). Parallel Computers 2: Architecture, Programming and Algorithms. Taylor & Francis. p. 46. ISBN 0-85274-811-6.

[19] Hawley, John Alfred (June 1975). "MUNIX, A Multiprocessing Version Of UNIX" (PDF). core.ac.uk. Retrieved 11 November 2018.

[AutoMQ-4-20] Variable SMP – A Multi-Core CPU Architecture for Low Power and High Performance. NVIDIA. 2011.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing