Multiprocessor system architecture

Last updated

A multiprocessor system is defined as "a system with more than one processor", and, more precisely, "a number of central processing units linked together to enable parallel processing to take place". [1] [2] [3]

Contents

The key objective of a multiprocessor is to boost a system's execution speed. The other objectives are fault tolerance and application matching. [4]

The term "multiprocessor" can be confused with the term "multiprocessing". While multiprocessing is a type of processing in which two or more processors work together to execute multiple programs simultaneously, multiprocessor refers to a hardware architecture that allows multiprocessing. [5]

Multiprocessor systems are classified according to how processor memory access is handled and whether system processors are of a single type or various ones.

Multiprocessor system types

There are many types of multiprocessor systems:

Loosely-coupled (distributed memory) multiprocessor system

Loosely coupled multiprocessor system Loosely Coupled Multiprocessor System.svg
Loosely coupled multiprocessor system

In loosely-coupled multiprocessor systems, each processor has its own local memory, input/output (I/O) channels, and operating system. Processors exchange data over a high-speed communication network by sending messages via a technique known as "message passing". Loosely-coupled multiprocessor systems are also known as distributed-memory systems, as the processors do not share physical memory and have individual I/O channels.

System characteristics

Tightly-coupled (shared memory) multiprocessor system

Multiprocessor system with a shared memory closely connected to the processors.

A symmetric multiprocessing system is a system with centralized shared memory called main memory (MM) operating under a single operating system with two or more homogeneous processors.

There are two types of systems:

Uniform memory access (UMA) system

Heterogeneous multiprocessor system

A heterogeneous multiprocessing system contains multiple, but not homogeneous, processing units – central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), or any type of application-specific integrated circuits (ASICs). The system architecture allows any accelerator – for instance, a graphics processor – to operate at the same processing level as the system's CPU.

Symmetric multiprocessor system

Symmetric multiprocessing system SMP - Symmetric Multiprocessor System.svg
Symmetric multiprocessing system

Systems operating under a single OS (operating system) with two or more homogeneous processors and with a centralized shared main memory.

A symmetric multiprocessor system (SMP) is a system with a pool of homogeneous processors running under a single OS with a centralized, shared main memory. Each processor, executing different programs and working on different sets of data, has the ability to share common resources (memory, I/O device, interrupt system, and so on) that are connected using a system bus, a crossbar, or a mix of the two, or an address bus and data crossbar.

Each processor has its own cache memory that acts as a bridge between the processor and main memory. The function of the cache is to alleviate the need for main-memory data access, thus reducing system-bus traffic.

Use of shared memory allows for a uniform memory-access time (UMA).

cc-NUMA system

cc-NUMA system Cc-NUMA System.svg
cc-NUMA system
cc-NUMA remote memory read Cc-NUMA Remote Memory Read.svg
cc-NUMA remote memory read

It is known that the SMP system has limited scalability. To overcome this limitation, the architecture called "cc-NUMA" (cache coherency–non-uniform memory access) is normally used. The main characteristic of a cc-NUMA system is having shared global memory that is distributed to each node, although the effective "access" a processor has to the memory of a remote component subsystem, or "node", is slower compared to local memory access, which is why the memory access is "non-uniform".

A cc–NUMA system is a cluster of SMP systems – each called a "node", which can have a single processor, a multi-core processor, or a mix of the two, of one or other kinds of architecture – connected via a high-speed "connection network" that can be a "link" that can be a single or double-reverse ring, or multi-ring, point-to-point connections, [6] [7] or a mix of these (e.g. IBM Power Systems [6] [8] ), bus interconnection (e.g. NUMAq [9] ), "crossbar", "segmented bus" (NUMA Bull HN ISI ex Honeywell, [10] ) "mesh router", etc.

cc-NUMA is also called "distributed shared memory" (DSM) architecture. [11]

The difference in access times between local and remote memory can be also an order of magnitude, depending on the kind of connection network used (faster in segmented bus, crossbar, and point-to-point interconnection; slower in serial rings connection).

Examples of interconnection

Double-reverse ring Double-reverse ring.svg
Double-reverse ring
Segmented bus Segmented Bus.svg
Segmented bus
Crossbar Crossbar.svg
Crossbar

To overcome this limit, a large remote cache (see Remote cache) is normally used. With this solution, the cc-NUMA system becomes very close to a large SMP system.

Tightly-coupled versus loosely-coupled architecture

Both architectures have trade-offs which may be summarized as follows:

Multiprocessor system featuring global data multiplication

An intermediate approach, between those of the two previous architectures, is having common resources and local resources, such as local memories (LM), in each processor.

The common resources are accessible from all processors via the system bus, while local resources are only accessible to the local processor. Cache memories can be viewed in this perspective as local memories.

This system (patented by F. Zulian [12] ), used on the DPX/2 300 Unix based system (Bull Hn Information Systems Italia (ex Honeywell)), [13] [14] is a mix of tightly and loosely coupled systems and makes use of all the advancements of these two architectures.

The local memory is divided into two sectors, global data (GD) and local data (LD).

The basic concept of this architecture is to have global data, which is modifiable information, accessible by all processors. This information is duplicated and stored in each local memory of each processor.

Each time the global data is modified in a local memory, a hardware write-broadcasting is sent to the system bus to all other local memories to maintain the global data coherency. Thus, global data may be read by each processor accessing its own local memory without involving the system bus. System bus access is only required when global data is modified in a local memory to update the copy of this data stored in the other local memories.

Local data can be exchanged in a loosely coupled system via message-passing

Multiprocessor system with global data multiplication Multiprocessor System Featuring Global Data Multiplication.svg
Multiprocessor system with global data multiplication
Multiprocessor system with global data multiplication - global data write-broadcasting Multiprocessor System - Global Data Write-Broadcasting.svg
Multiprocessor system with global data multiplication - global data write-broadcasting

Related Research Articles

<span class="mw-page-title-main">Non-uniform memory access</span> Computer memory design used in multiprocessing

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus..

<span class="mw-page-title-main">Symmetric multiprocessing</span> The equal sharing of all resources by multiple identical processors

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

<span class="mw-page-title-main">Scalable Coherent Interface</span> High-speed interconnect standard for shared memory multiprocessing and message passing

The Scalable Coherent Interface or Scalable Coherent Interconnect (SCI), is a high-speed interconnect standard for shared memory multiprocessing and message passing. The goal was to scale well, provide system-wide memory coherence and a simple interface; i.e. a standard to replace existing buses in multiprocessor systems with one with no inherent scalability and performance limitations.

<span class="mw-page-title-main">Multiple instruction, multiple data</span> Computing technique employed to achieve parallelism

In computing, multiple instruction, multiple data (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data.

<span class="mw-page-title-main">Cache coherence</span> Computer architecture term concerning shared resource data

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

Sequent Computer Systems was a computer company that designed and manufactured multiprocessing computer systems. They were among the pioneers in high-performance symmetric multiprocessing (SMP) open systems, innovating in both hardware and software.

Cache only memory architecture (COMA) is a computer memory organization for use in multiprocessors in which the local memories at each node are used as cache. This is in contrast to using the local memories as actual main memory, as in NUMA organizations.

K42 is a discontinued open-source research operating system (OS) for cache-coherent 64-bit multiprocessor systems. It was developed primarily at IBM Thomas J. Watson Research Center in collaboration with the University of Toronto and University of New Mexico. The main focus of this OS is to address performance and scalability issues of system software on large-scale, shared memory, non-uniform memory access (NUMA) multiprocessing computers.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

Uniform memory access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In an UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Uniform memory access computer architectures are often contrasted with non-uniform memory access (NUMA) architectures. In the NUMA architecture, each processor may use a private cache. Peripherals are also shared in some fashion. The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time-critical applications.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opened in 1992 and its goal was to make the first PowerPC processor and then keep designing general purpose PowerPC processors for personal computers. The first incarnation became the PowerPC 601 in 1993, and the second generation soon followed with the PowerPC 603, PowerPC 604 and the 64-bit PowerPC 620.

<span class="mw-page-title-main">SGI Origin 2000</span> Series of server computers

The SGI Origin 2000 is a family of mid-range and high-end server computers developed and manufactured by Silicon Graphics (SGI). They were introduced in 1996 to succeed the SGI Challenge and POWER Challenge. At the time of introduction, these ran the IRIX operating system, originally version 6.4 and later, 6.5. A variant of the Origin 2000 with graphics capability is known as the Onyx2. An entry-level variant based on the same architecture but with a different hardware implementation is known as the Origin 200. The Origin 2000 was succeeded by the Origin 3000 in July 2000, and was discontinued on June 30, 2002.

<span class="mw-page-title-main">UltraSPARC III</span> Microprocessor developed by Sun Microsystems

The UltraSPARC III, code-named "Cheetah", is a microprocessor that implements the SPARC V9 instruction set architecture (ISA) developed by Sun Microsystems and fabricated by Texas Instruments. It was introduced in 2001 and operates at 600 to 900 MHz. It was succeeded by the UltraSPARC IV in 2004. Gary Lauterbach was the chief architect.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

<span class="mw-page-title-main">Shared memory</span> Computer memory that can be accessed by multiple processes

In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors.

Directory-based coherence is a mechanism to handle cache coherence problem in distributed shared memory (DSM) a.k.a. non-uniform memory access (NUMA). Another popular way is to use a special type of computer bus between all the nodes as a "shared bus". Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium as a tool to facilitate the communication between different nodes, and to guarantee that the coherence protocol is working properly along all the communicating nodes. In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

Examples of coherency protocols for cache memory are listed here. For simplicity, all "miss" Read and Write status transactions which obviously come from state "I", in the diagrams are not shown. They are shown directly on the new state. Many of the following protocols have only historical value. At the moment the main protocols used are the R-MESI type / MESIF protocols and the HRT-ST-MESI or a subset or an extension of these.

References

  1. "Multiprocessor definition and meaning - Collins English Dictionary". www.collinsdictionary.com.
  2. "Data" (PDF). www.cs.vu.nl.
  3. "multiprocessor – Definition of multiprocessor in English by Oxford Dictionaries". Oxford Dictionaries - English. Archived from the original on November 4, 2018.
  4. "What is a Multiprocessor? - Definition from Techopedia". Techopedia.com.
  5. "Multiprocessor dictionary definition - multiprocessor defined". www.yourdictionary.com.
  6. 1 2 AMD Opteron Shared Memory MP Systems – http://www.cse.wustl.edu/~roger/569M.s09/28_AMD_Hammer_MP_HC_v8.pdf
  7. An Introduction to the Intel® QuickPath Interconnect – http://www.intel.ie/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf
  8. "IBM POWER Systems Overview". computing.llnl.gov.
  9. SourceForge – http://lse.sourceforge.net/numa/faq/system_descriptions.html
  10. Bull HN F. Zulian – A. Zulian patent – Computer system with a bus having a segmented structure – http://www.freepatentsonline.com/6314484.html
  11. NUMA Architecture – http://www.dba-oracle.com/real_application_clusters_rac_grid/numa.html
  12. "Multiprocessor system featuring global data multiplation".
  13. "UNIX and Bull". www.feb-patrimoine.com.
  14. "Bull DPX". www.feb-patrimoine.com.