Manycore processor

Last updated December 20, 2023

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores (from a few tens of cores to thousands or more). Manycore processors are used extensively in embedded computers and high-performance computing.

Contrast with multicore architecture

Manycore processors are distinct from multi-core processors in being optimized from the outset for a higher degree of explicit parallelism, and for higher throughput (or lower power consumption) at the expense of latency and lower single-thread performance.

The broader category of multi-core processors, by contrast, are usually designed to efficiently run both parallel and serial code, and therefore place more emphasis on high single-thread performance (e.g. devoting more silicon to out-of-order execution, deeper pipelines, more superscalar execution units, and larger, more general caches), and shared memory. These techniques devote runtime resources toward figuring out implicit parallelism in a single thread. They are used in systems where they have evolved continuously (with backward compatibility) from single core processors. They usually have a 'few' cores (e.g. 2, 4, 8) and may be complemented by a manycore accelerator (such as a GPU) in a heterogeneous system.

Motivation

Cache coherency is an issue limiting the scaling of multicore processors. Manycore processors may bypass this with methods such as message passing,^[1] scratchpad memory, DMA,^[2] partitioned global address space,^[3] or read-only/non-coherent caches. A manycore processor using a network on a chip and local memories gives software the opportunity to explicitly optimise the spatial layout of tasks (e.g. as seen in tooling developed for TrueNorth).^[4]

Manycore processors may have more in common (conceptually) with technologies originating in high-performance computing such as clusters and vector processors.^[5]

GPUs may be considered a form of manycore processor having multiple shader processing units, and only being suitable for highly parallel code (high throughput, but extremely poor single thread performance).

Suitable programming models

Message passing interface
OpenCL ^[6] or other APIs supporting compute kernels
Partitioned global address space
Actor model
OpenMP ^[7]
Dataflow

Classes of manycore systems

GPUs, which can be described as manycore vector processors
Massively parallel processor array
Asynchronous array of simple processors

Specific manycore architectures

ZettaScaler , Japanese PEZY Computing 2,048-core modules
Xeon Phi coprocessor, which has MIC (Many Integrated Cores) architecture
Tilera
Adapteva Epiphany Architecture, a manycore chip using PGAS scratchpad memory
Coherent Logix hx3100 Processor, a 100-core DSP/GPP processor based on HyperX Architecture
Movidius Myriad 2, a manycore vision processing unit (VPU)
Kalray, a manycore PCI-e accelerator for data-intensive tasks
Teraflops Research Chip, a manycore processor using message passing
TrueNorth, an AI accelerator with a manycore network on a chip architecture
Green arrays, a manycore processor using message passing aimed at low power applications
Sunway SW26010, a 260-core manycore processor used in the then top 1 supercomputer Sunway TaihuLight
- SW52020, an improved 520-core^[8]^[9] variant of SW26010, with 512-bit SIMD (also adding support for half-precision), used in a prototype, meant for an exascale system (and in the future 10 exascale system), and according to datacenterdynamics China is rumored to already have two separate exascale systems secretly^{[ citation needed ]}
Eyeriss, a manycore processor designed for running convolutional neural nets for embedded vision applications^[10]
Graphcore, a manycore AI accelerator

Specific manycore computers with 1M+ CPU cores

A number of computers built from multicore processors have one million or more individual CPU cores. Examples include:

Gyoukou (Japanese: 暁光 Hepburn: gyōkō, dawn light), a supercomputer developed by ExaScaler and PEZY Computing, with 20,480,000 processing elements total plus the 1,250 Intel Xeon D host processors.
SpiNNaker, a massively parallel (1 million CPU cores) manycore processor (ARM-based) built as part of the Human Brain Project.

Specific computers with 5 million or more CPU cores

Quite a few supercomputers have over 5 million CPU cores. When there are also coprocessors, e.g. GPUs used with, then those cores are not listed in the core-count, then quite a few more computers would hit those targets.

Frontier
Fugaku, a Japanese supercomputer using Fujitsu A64FX ARM-based cores, 7,630,848 in total.
Sunway TaihuLight, a massively parallel (10 million CPU cores) Chinese supercomputer, once one of the fastest supercomputers in the world, using a custom manycore architecture.^{[ citation needed ]} As of November 2018, it was the world's third fastest supercomputer (as ranked by the TOP500 list), obtaining its performance from 40,960 SW26010 manycore processors, each containing 256 cores.

Related Research Articles

In computing, floating point operations per second is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measure than measuring instructions per second.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor, scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coincides with the International Supercomputing Conference in June, and the second is presented at the ACM/IEEE Supercomputing Conference in November. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and bases rankings on HPL benchmarks, a portable implementation of the high-performance LINPACK benchmark written in Fortran for distributed-memory computers.

In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. This rate is typically measured by performance on the LINPACK benchmark when trying to compare between computing systems: an example using this is the Green500 list of supercomputers. Performance per watt has been suggested to be a more sustainable measure of computing than Moore’s Law.

A massively parallel processor array, also known as a multi purpose processor array (MPPA) is a type of integrated circuit which has a massively parallel array of hundreds or thousands of CPUs and RAM memories. These processors pass work to one another through a reconfigurable interconnect of channels. By harnessing a large number of processors working in parallel, an MPPA chip can accomplish more demanding tasks than conventional chips. MPPAs are based on a software parallel programming model for developing high-performance embedded system applications.

Zero ASIC Corporation, formerly Adapteva, Inc., is a fabless semiconductor company focusing on low power many core microprocessor design. The company was the second company to announce a design with 1,000 specialized processing cores on a single integrated circuit.

Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.

FeiTeng is the name of several computer central processing units designed and produced in China for supercomputing applications. The microprocessors have been developed by Tianjin Phytium Technology. The processors have also been described as the YinHeFeiTeng family. This CPU family has been developed by a team directed by NUDT's Professor Xing Zuocheng.

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.

Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of threads.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.

The Sunway TaihuLight is a Chinese supercomputer which, as of November 2021, is ranked fourth in the TOP500 list, with a LINPACK benchmark rating of 93 petaflops. The name is translated as divine power, the light of Taihu Lake. This is nearly three times as fast as the previous Tianhe-2, which ran at 34 petaflops. As of June 2017, it is ranked as the 16th most energy-efficient supercomputer in the Green500, with an efficiency of 6.1 GFlops/watt. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China.

The SW26010 is a 260-core manycore processor designed by the Shanghai Integrated Circuit Technology and Industry Promotion Center (Chinese: 上海集成电路技术与产业促进中心 ). It implements the Sunway architecture, a 64-bit reduced instruction set computing (RISC) architecture designed in China. The SW26010 has four clusters of 64 Compute-Processing Elements (CPEs) which are arranged in an eight-by-eight array. The CPEs support SIMD instructions and are capable of performing eight double-precision floating-point operations per cycle. Each cluster is accompanied by a more conventional general-purpose core called the Management Processing Element (MPE) that provides supervisory functions. Each cluster has its own dedicated DDR3 SDRAM controller and a memory bank with its own address space. The processor runs at a clock speed of 1.45 GHz.

References

↑ Mattson, Tim (January 2010). "The Future of Many Core Computing: A tale of two processors" (PDF).
↑ Hendry, Gilbert; Kretschmann, Mark. "IBM Cell Processor" (PDF).
↑ Olofsson, Andreas; Nordström, Tomas; Ul-Abdin, Zain (2014). "Kickstarting High-performance Energy-efficient Manycore Architectures with Epiphany". arXiv: 1412.5538 [cs.AR].
↑ Amir, Arnon (June 11, 2015). "IBM SyNAPSE Deep Dive Part 3". IBM Research. Archived from the original on 2021-12-21.
↑ "cell architecture"."The Cell architecture is like nothing we have ever seen in commodity microprocessors, it is closer in design to multiprocessor vector supercomputers"
↑ Rick Merritt (June 20, 2011), "OEMs show systems with Intel MIC chips", www.eetimes.com, EE Times
↑ Barker, J; Bowden, J (2013). "Manycore Parallelism through OpenMP". OpenMP in the Era of Low Power Devices and Accelerators. IWOMP. Lecture Notes in Computer Science, vol 8122. Springer. doi:10.1007/978-3-642-40698-0_4.
↑ Morgan, Timothy Prickett (2021-02-10). "A First Peek At China's Sunway Exascale Supercomputer". The Next Platform. Retrieved 2021-11-18.
↑ Hemsoth, Nicole (2021-04-19). "China's Exascale Prototype Supercomputer Tests AI Workloads". The Next Platform. Retrieved 2021-11-18.
↑ Chen, Yu-Hsin; Krishna, Tushar; Emer, Joel; Sze, Vivienne (2016). "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks". IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. pp. 262–263.

External links

Architecting solutions for the Manycore future, published on Feb 19, 2010 (more than one dead link in the slide)
Eyeriss architecture

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Mattson, Tim (January 2010). "The Future of Many Core Computing: A tale of two processors" (PDF).

[2] Hendry, Gilbert; Kretschmann, Mark. "IBM Cell Processor" (PDF).

[3] Olofsson, Andreas; Nordström, Tomas; Ul-Abdin, Zain (2014). "Kickstarting High-performance Energy-efficient Manycore Architectures with Epiphany". arXiv: 1412.5538 [cs.AR].

[4] Amir, Arnon (June 11, 2015). "IBM SyNAPSE Deep Dive Part 3". IBM Research. Archived from the original on 2021-12-21.

[5] "cell architecture"."The Cell architecture is like nothing we have ever seen in commodity microprocessors, it is closer in design to multiprocessor vector supercomputers"

[6] Rick Merritt (June 20, 2011), "OEMs show systems with Intel MIC chips", www.eetimes.com, EE Times

[7] Barker, J; Bowden, J (2013). "Manycore Parallelism through OpenMP". OpenMP in the Era of Low Power Devices and Accelerators. IWOMP. Lecture Notes in Computer Science, vol 8122. Springer. doi:10.1007/978-3-642-40698-0_4.

[8] Morgan, Timothy Prickett (2021-02-10). "A First Peek At China's Sunway Exascale Supercomputer". The Next Platform. Retrieved 2021-11-18.

[9] Hemsoth, Nicole (2021-04-19). "China's Exascale Prototype Supercomputer Tests AI Workloads". The Next Platform. Retrieved 2021-11-18.

[10] Chen, Yu-Hsin; Krishna, Tushar; Emer, Joel; Sze, Vivienne (2016). "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks". IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. pp. 262–263.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing