Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). [1] NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus. [2]
NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. They were developed commercially during the 1990s by Unisys, Convex Computer (later Hewlett-Packard), Honeywell Information Systems Italy (HISI) (later Groupe Bull), Silicon Graphics (later Silicon Graphics International), Sequent Computer Systems (later IBM), Data General (later EMC, now Dell Technologies), Digital (later Compaq, then HP, now HPE) and ICL. Techniques developed by these companies later featured in a variety of Unix-like operating systems, and to an extent in Windows NT.
The first commercial implementation of a NUMA-based Unix system was[ where? ][ when? ] the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for Honeywell Information Systems Italy.
Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first supercomputers. Since then, CPUs increasingly have found themselves "starved for data" and having to stall while waiting for data to arrive from memory (e.g. for Von-Neumann architecture-based computers, see Von Neumann bottleneck). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.
Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid cache misses. But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time. [3]
NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks). [4] Another approach to addressing this problem is the multi-channel memory architecture, in which a linear increase in the number of memory channels increases the memory access concurrency linearly. [5]
Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA heavily depends on the nature of the running tasks. [4]
AMD implemented NUMA with its Opteron processor (2003), using HyperTransport. Intel announced NUMA compatibility for its x86 and Itanium servers in late 2007 with its Nehalem and Tukwila CPUs. [6] Both Intel CPU families share a common chipset; the interconnection is called Intel QuickPath Interconnect (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called Intel UltraPath Interconnect with the release of Skylake (2017). [7]
Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. [8]
Typically, ccNUMA uses inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors attempt to access the same memory area in rapid succession. Support for NUMA in operating systems attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. [9]
Alternatively, cache coherency protocols such as the MESIF protocol attempt to reduce the communication required to maintain cache coherency. Scalable Coherent Interface (SCI) is an IEEE standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology. [10] [11]
One can view NUMA as a tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater (slower) than that of hardware-based NUMA. [2]
Since NUMA largely influences memory access performance, certain software optimizations are needed to allow scheduling threads and processes close to their in-memory data.
As of 2011, ccNUMA systems are multiprocessor systems based on the AMD Opteron processor, which can be implemented without external logic, and the Intel Itanium processor, which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor.
Itanium is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture. The Itanium architecture originated at Hewlett-Packard (HP), and was later jointly developed by HP and Intel. Launched in June 2001, Intel initially marketed the processors for enterprise servers and high-performance computing systems. In the concept phase, engineers said "we could run circles around PowerPC...we could kill the x86." Early predictions were that IA-64 would expand to the lower-end servers, supplanting Xeon, and eventually penetrate into the personal computers, eventually to supplant reduced instruction set computing (RISC) and complex instruction set computing (CISC) architectures for all general-purpose applications.
Silicon Graphics, Inc. was an American high-performance computing manufacturer, producing computer hardware and software. Founded in Mountain View, California, in November 1981 by James Clark, its initial market was 3D graphics computer workstations, but its products, strategies and market positions developed significantly over time.
Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.
Direct memory access (DMA) is a feature of computer systems that allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU).
Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.
IA-64 is the instruction set architecture (ISA) of the discontinued Itanium family of 64-bit Intel microprocessors. The basic ISA specification originated at Hewlett-Packard (HP), and was subsequently implemented by Intel in collaboration with HP. The first Itanium processor, codenamed Merced, was released in 2001.
In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, address buses, or data buses of that size. A computer that uses such a processor is a 64-bit computer.
Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors. It was introduced on Xeon server processors in February 2002 and on Pentium 4 desktop processors in November 2002. Since then, Intel has included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.
Sequent Computer Systems was a computer company that designed and manufactured multiprocessing computer systems. They were among the pioneers in high-performance symmetric multiprocessing (SMP) open systems, innovating in both hardware and software.
A network interface controller is a computer hardware component that connects a computer to a computer network.
Altix is a line of server computers and supercomputers produced by Silicon Graphics, based on Intel processors. It succeeded the MIPS/IRIX-based Origin 3000 servers.
HPE Integrity Servers is a series of server computers produced by Hewlett Packard Enterprise since 2003, based on the Itanium processor. The Integrity brand name was inherited by HP from Tandem Computers via Compaq.
K42 is a discontinued open-source research operating system (OS) for cache-coherent 64-bit multiprocessor systems. It was developed primarily at IBM Thomas J. Watson Research Center in collaboration with the University of Toronto and University of New Mexico. The main focus of this OS is to address performance and scalability issues of system software on large-scale, shared memory, non-uniform memory access (NUMA) multiprocessing computers.
The HPE Superdome is a high-end server computer designed and manufactured by Hewlett Packard Enterprise. The product's most recent version, "Superdome 2," was released in 2010 supporting 2 to 32 sockets and 4 TB of memory. The Superdome used PA-RISC processors when it debuted in 2000. Since 2002, a second version of the machine based on Itanium 2 processors has been marketed as the HP Integrity Superdome.
Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.
Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors.
Since 1985, many processors implementing some version of the MIPS architecture have been designed and used widely.
Meltdown is one of the two original transient execution CPU vulnerabilities. Meltdown affects Intel x86 microprocessors, IBM Power microprocessors, and some ARM-based microprocessors. It allows a rogue process to read all memory, even when it is not authorized to do so.
A multiprocessor system is defined as "a system with more than one processor", and, more precisely, "a number of central processing units linked together to enable parallel processing to take place".