Massively parallel processor array

Last updated

A massively parallel processor array, also known as a multi purpose processor array (MPPA) is a type of integrated circuit which has a massively parallel array of hundreds or thousands of CPUs and RAM memories. These processors pass work to one another through a reconfigurable interconnect of channels. By harnessing a large number of processors working in parallel, an MPPA chip can accomplish more demanding tasks than conventional chips. MPPAs are based on a software parallel programming model for developing high-performance embedded system applications.

Contents

Architecture

MPPA is a MIMD (Multiple Instruction streams, Multiple Data) architecture, with distributed memory accessed locally, not shared globally. Each processor is strictly encapsulated, accessing only its own code and memory. Point-to-point communication between processors is directly realized in the configurable interconnect. [1]

The MPPA's massive parallelism and its distributed memory MIMD architecture distinguishes it from multicore and manycore architectures, which have fewer processors and an SMP or other shared memory architecture, mainly intended for general-purpose computing. It's also distinguished from GPGPUs with SIMD architectures, used for HPC applications. [2]

Programming

An MPPA application is developed by expressing it as a hierarchical block diagram or workflow, whose basic objects run in parallel, each on their own processor. Likewise, large data objects may be broken up and distributed into local memories with parallel access. Objects communicate over a parallel structure of dedicated channels. The objective is to maximize aggregate throughput while minimizing local latency, optimizing performance and efficiency. An MPPA's model of computation is similar to a Kahn process network or communicating sequential processes (CSP). [3]

Applications

MPPAs are used in high-performance embedded systems and hardware acceleration of desktop computer and server applications, such as video compression, [4] [5] image processing, [6] medical imaging, network processing, software-defined radio and other compute-intensive streaming media applications, which otherwise would use FPGA, DSP and/or ASIC chips.

Examples

MPPAs developed in companies include ones designed at: Ambric, PicoChip, Intel, [7] IntellaSys, GreenArrays, ASOCS, Tilera, Kalray, Coherent Logix, Tabula, and Adapteva. Aspex (Ericsson) Linedancer differs in that it was a Massive wide SIMD Array rather than an MPPA. Strictly speaking it could qualify as SIMT due to all 4096 of the 3,000 gate cores having its own Content-Addressable Memory. [8] [9]

Fabricated MPPAs developed in universities include: 36-core [10] and 167-core [11] Asynchronous Array of Simple Processors (AsAP) arrays from the University of California, Davis, 16-core RAW [12] from MIT, and 16-core [13] and 24-core [14] arrays from Fudan University.

The Chinese Sunway project developed their own 260-core SW26010 manycore chip for the TaihuLight supercomputer, which is as of 2016 the world's fastest supercomputer. [15] [16]

Anton 3 processors, designed by D. E. Shaw Research for molecular dynamics simulations, contain arrays of 576 processors arranged in a 12×24 tiled grid of pairs of cores; a routed network links these tiles together and extends off-chip to other nodes in a full system. [17] [18]

See also

Related Research Articles

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called cores to emphasize their multiplicity. Each core reads and executes program instructions, specifically ordinary CPU instructions. However, the MCP can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single IC die, known as a chip multiprocessor (CMP), or onto multiple dies in a single chip package. As of 2024, the microprocessors used in almost all new personal computers are multi-core.

The transistor count is the number of transistors in an electronic device. It is the most common measure of integrated circuit complexity. The rate at which MOS transistor counts have increased generally follows Moore's law, which observes that transistor count doubles approximately every two years. However, being directly proportional to the area of a die, transistor count does not represent how advanced the corresponding manufacturing technology is. A better indication of this is transistor density which is the ratio of a semiconductor's transistor count to its die area.

The asynchronous array of simple processors (AsAP) architecture comprises a 2-D array of reduced complexity programmable processors with small scratchpad memories interconnected by a reconfigurable mesh network. AsAP was developed by researchers in the VLSI Computation Laboratory (VCL) at the University of California, Davis and achieves high performance and energy efficiency, while using a relatively small circuit area. It was made in 2006.

A field-programmable analog array (FPAA) is an integrated circuit device containing computational analog blocks (CAB) and interconnects between these blocks offering field-programmability. Unlike their digital cousin, the FPGA, the devices tend to be more application driven than general purpose as they may be current mode or voltage mode devices. For voltage mode devices, each block usually contains an operational amplifier in combination with programmable configuration of passive components. The blocks can, for example, act as summers or integrators.

In computer science, partitioned global address space (PGAS) is a parallel programming model paradigm. PGAS is typified by communication operations involving a global memory address space abstraction that is logically partitioned, where a portion is local to each process, thread, or processing element. The novelty of PGAS is that the portions of the shared memory space may have an affinity for a particular process, thereby exploiting locality of reference in order to improve performance. A PGAS memory model is featured in various parallel programming languages and libraries, including: Coarray Fortran, Unified Parallel C, Split-C, Fortress, Chapel, X10, UPC++, Coarray C++, Global Arrays, DASH and SHMEM. The PGAS paradigm is now an integrated part of the Fortran language, as of Fortran 2008 which standardized coarrays.

Intel Teraflops Research Chip is a research manycore processor containing 80 cores, using a network-on-chip architecture, developed by Intel's Tera-Scale Computing Research Program. It was manufactured using a 65 nm CMOS process with eight layers of copper interconnect and contains 100 million transistors on a 275 mm2 die. Its design goal was to demonstrate a modular architecture capable of a sustained performance of 1.0 TFLOPS while dissipating less than 100 W. Research from the project was later incorporated into Xeon Phi. The technical lead of the project was Sriram R. Vangal.

Ambric, Inc. was a designer of computer processors that developed the Ambric architecture. Its Am2045 Massively Parallel Processor Array (MPPA) chips were primarily used in high-performance embedded systems such as medical imaging, video, and signal-processing.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

Tile processors for computer hardware, are multicore or manycore chips that contain one-dimensional, or more commonly, two-dimensional arrays of identical tiles. Each tile comprises a compute unit, caches and a switch. Tiles can be viewed as adding a switch to each core, where a core comprises a compute unit and caches.

Zero ASIC Corporation, formerly Adapteva, Inc., is a fabless semiconductor company focusing on low power many core microprocessor design. The company was the second company to announce a design with 1,000 specialized processing cores on a single integrated circuit.

Sunway, or ShenWei,, is a series of computer microprocessors, developed by Jiangnan Computing Lab (江南计算技术研究所) in Wuxi, China. It uses a reduced instruction set computer (RISC) architecture, but details are still sparse.

<span class="mw-page-title-main">SpiNNaker</span>

SpiNNaker is a massively parallel, manycore supercomputer architecture designed by the Advanced Processor Technologies Research Group (APT) at the Department of Computer Science, University of Manchester. It is composed of 57,600 processing nodes, each with 18 ARM9 processors and 128 MB of mobile DDR SDRAM, totalling 1,036,800 cores and over 7 TB of RAM. The computing platform is based on spiking neural networks, useful in simulating the human brain.

Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of threads.

<span class="mw-page-title-main">Tianhe-2</span> Supercomputer in Guangzhou, China

Tianhe-2 or TH-2 is a 3.86-petaflop supercomputer located in the National Supercomputer Center in Guangzhou, China. It was developed by a team of 1,300 scientists and engineers.

A cognitive computer is a computer that hardwires artificial intelligence and machine learning algorithms into an integrated circuit that closely reproduces the behavior of the human brain. It generally adopts a neuromorphic engineering approach. Synonyms include neuromorphic chip and cognitive chip.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

The Sunway TaihuLight is a Chinese supercomputer which, as of November 2023, is ranked 11th in the TOP500 list, with a LINPACK benchmark rating of 93 petaflops. The name is translated as divine power, the light of Taihu Lake. This is nearly three times as fast as the previous Tianhe-2, which ran at 34 petaflops. As of June 2017, it is ranked as the 16th most energy-efficient supercomputer in the Green500, with an efficiency of 6.1 GFlops/watt. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China.

The SW26010 is a 260-core manycore processor designed by the Shanghai Integrated Circuit Technology and Industry Promotion Center (Chinese: 上海集成电路技术与产业促进中心 ). It implements the Sunway architecture, a 64-bit reduced instruction set computing (RISC) architecture designed in China. The SW26010 has four clusters of 64 Compute-Processing Elements (CPEs) which are arranged in an eight-by-eight array. The CPEs support SIMD instructions and are capable of performing eight double-precision floating-point operations per cycle. Each cluster is accompanied by a more conventional general-purpose core called the Management Processing Element (MPE) that provides supervisory functions. Each cluster has its own dedicated DDR3 SDRAM controller and a memory bank with its own address space. The processor runs at a clock speed of 1.45 GHz.

References

  1. Mike Butts, "Synchronization through Communication in a Massively Parallel Processor Array", IEEE Micro, vol. 27, no. 5, September/October 2007, IEEE Computer Society
  2. Mike Butts, "Multicore and Massively Parallel Platforms and Moore's Law Scalability", Proceedings of the Embedded Systems Conference - Silicon Valley, April 2008
  3. Mike Butts, Brad Budlong, Paul Wasson, Ed White, "Reconfigurable Work Farms on a Massively Parallel Processor Array", Proceedings of FCCM, April 2008, IEEE Computer Society
  4. Laurent Bonetto, "Massively parallel processing arrays (MPPAs) for embedded HD video and imaging (Part 1)", Video/Imaging DesignLine, May 16, 2008 http://www.eetimes.com/document.asp?doc_id=1273823
  5. Laurent Bonetto, "Massively parallel processing arrays (MPPAs) for embedded HD video and imaging (Part 2)", Video/Imaging DesignLine, July 18, 2008 http://www.eetimes.com/document.asp?doc_id=1273830
  6. Paul Chen, "Multimode sensor processing using Massively Parallel Processor Arrays (MPPAs)", Programmable Logic DesignLine, March 18, 2008 http://www.pldesignline.com/howto/206904379
  7. Vangal, Sriram R., Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan et al. "An 80-tile sub-100-w teraflops processor in 65-nm cmos." Solid-State Circuits, IEEE Journal of 43, no. 1 (2008): 29-41.
  8. Krikelis, A. (1990). "Artificial Neural Network on a Massively Parallel Associative Architecture". International Neural Network Conference. p. 673. doi:10.1007/978-94-009-0643-3_39. ISBN   978-0-7923-0831-7.
  9. https://core.ac.uk/download/pdf/25268094.pdf [ bare URL PDF ]
  10. Yu, Zhiyi, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, and Bevan Baas. "An asynchronous array of simple processors for DSP applications." In IEEE International Solid-State Circuits Conference,(ISSCC’06), vol. 49, pp. 428-429. 2006
  11. Truong, Dean, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen et al. "A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling." In Symposium on VLSI Circuits, pp. 22-23. 2008
  12. Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Walter Lee, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Saman Amarasinghe, and Anant Agarwal, "A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network," Proceedings of the IEEE International Solid-State Circuits Conference, February 2003
  13. Yu, Zhiyi, Kaidi You, Ruijin Xiao, Heng Quan, Peng Ou, Yan Ying, Haofan Yang, and Xiaoyang Zeng. "An 800MHz 320mW 16-core processor with message-passing and shared-memory inter-core communication mechanisms." In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, pp. 64-66. IEEE, 2012.
  14. Ou, Peng, Jiajie Zhang, Heng Quan, Yi Li, Maofei He, Zheng Yu, Xueqiu Yu et al. "A 65nm 39GOPS/W 24-core processor with 11 Tb/s/W packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execution array." In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International, pp. 56-57. IEEE, 2013.
  15. Dongarra, Jack (June 20, 2016). "Report on the Sunway TaihuLight System" (PDF). www.netlib.org. Retrieved June 20, 2016.
  16. Fu, Haohuan; Liao, Junfeng; Yang, Jinzhe; et al. (2016). "The Sunway TaihuLight Supercomputer: System and Applications". Sci. China Inf. Sci. 59 (7). doi: 10.1007/s11432-016-5588-7 .
  17. Shaw, David E.; Adams, Peter J.; Azaria, Asaph; Bank, Joseph A.; Batson, Brannon; Bell, Alistair; Bergdorf, Michael; Bhatt, Jhanvi; Butts, J. Adam; Correia, Timothy; Dirks, Robert M.; Dror, Ron O.; Eastwood, Michael P.; Edwards, Bruce; Even, Amos (2021-11-14). "Anton 3". Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis Missouri: ACM. pp. 1–11. doi: 10.1145/3458817.3487397 . ISBN   978-1-4503-8442-1. S2CID   239036976.
  18. Adams, Peter J.; Batson, Brannon; Bell, Alistair; Bhatt, Jhanvi; Butts, J. Adam; Correia, Timothy; Edwards, Bruce; Feldmann, Peter; Fenton, Christopher H.; Forte, Anthony; Gagliardo, Joseph; Gill, Gennette; Gorlatova, Maria; Greskamp, Brian; Grossman, J.P. (2021-08-22). "The ΛNTON 3 ASIC: A Fire-Breathing Monster for Molecular Dynamics Simulations". 2021 IEEE Hot Chips 33 Symposium (HCS). Palo Alto, CA, USA: IEEE. pp. 1–22. doi:10.1109/HCS52781.2021.9567084. ISBN   978-1-6654-1397-8. S2CID   239039245.