WARP (systolic array)

Last updated

The Warp machines were 3 generations of increasingly general-purpose systolic array processors. Each generation became increasingly general-purpose by increasing memory capacity and loosening the coupling between processors. Only the original WW-Warp forced a truly lock step sequencing of stages, which severely restricted its programmability but was in a sense the purest “systolic-array” design.

Contents

History

The Warp machines were created by Carnegie Mellon University (CMU), in conjunction with industrial partners G.E., Honeywell and Intel, and funded by the U.S. Defense Advanced Research Projects Agency (DARPA). [1]

The Warp projects were started in 1984 by H. T. Kung at Carnegie Mellon University. The Warp projects yielded research results, publications and advancements in general purpose systolic hardware design, compiler design and systolic software algorithms.

A two cell prototype of WW-Warp was complete at Carnegie Mellon in June 1985. Two essentially identical ten-cell WW-Warp were produced in 1986, one by Honeywell and one by G.E., for use at Carnegie Mellon University. The system from G.E. was delivered in February 1986; the system from Honeywell was delivered in June 1986. The first of the significantly redesign production model, the PC-Warp, was delivered by G.E. in April 1987. About twenty production models of the PC-Warp were produced and sold by G.E. during 1987-1989.

In 1986, Intel was selected, as a result of competitive bidding, to be the industrial partner for the integrated circuit implementation of Warp. The first iWarp system, a 12-node system, became operational in March 1990. After a number of stepping of the part, about 39 machines, consisting of ten or more C-Step iWarp chips running at 20 MHz, were produced and sold by Intel in 1992 and 1993 to universities, government agencies and industrial research laboratories. [2]

Architecture

There were three distinct machine designs known as the WW-Warp (Wire Wrap Warp), PC-Warp (Printed Circuit Warp), and iWarp (integrated circuit Warp, conveniently also a play on the “i” for Intel). [3]

WW-Warp

WW-Warp forced a truly lock step sequencing of stages.

Linear array of ten or more programmable processing elements (PEs), each at 10 MFLOPS (SP).

PC-Warp

Linear array of ten or more programmable processing elements (PEs), each at 10 MFLOPS (SP).

iWarp

Linear array of ten or more programmable processing elements (PEs), each at 20 MFLOPS (SP). [4]

One PE consists of two main agents: a Computation Agent and a Communication Agent. [5]

The iWarp machines were based on a single-chip custom 700,000 transistor microprocessor, designed specifically for the Warp project, that utilized long-instruction-word (LIW) format instructions and tightly integrated communications with the computational processor. The standard iWarp machines configuration arranged iWarp nodes in a 2m x 2n torus. All iWarp machines included the “backedges” and, therefore, were tori. [6]

Applications

Warp machines were attached to Sun workstations (UNIX based). Software development for all models of Warp machines was done on Sun workstations.

The originally intended application for Warp machines was low-lev el computer vision (convolutions, filtering, etc). It then found applications in magnetic resonance image processing, repetitive image texture analysis, and linear algebra. [7]

Neural network

The 10-cell Warp (not iWarp) computer was benchmarked on performing a forward-backward propagation on the NETtalk. It achieved 16.5 MC/s (million connections per second), meaning that to run one forward and one backward pass over NETtalk's 18,629 weights takes .

This was a 8x speedup over a backpropagation algorithm on the Connection Machine-1, and 340x speedup over the original implementation on the Ridge 32. [8] When the 10-cell iWarp came, the authors ran backpropagation on it with essentially the same implementation. It ran at 36 MC/s, a 760x speedup. [9]

Compiler

A research compiler, for a language known as “W2,” targeted all three machines and was the only compiler for the WW-Warp and PC-Warp while it served as an early compiler during development of the iWarp. [10] The production compiler for iWarp was a C and Fortran compiler based on the AT&T pcc compiler for UNIX, ported under contract for Intel and then extensively modified and extend by Intel. [11]

See also

Notes

  1. Thomas Gross and Monica Lam. 1998. Retrospective: a retrospective on the Warp machines. In 25 years of the international symposia on Computer architecture (selected papers) (ISCA '98), Gurindar S. Sohi (Ed.). ACM, New York, NY, USA, 45-47.
  2. Encyclopedia of Parallel Computing, Padua, David (Ed.), 2011, ISBN   978-0-387-09765-7
  3. Thomas Gross and David R. O'Hallaron. iWarp: anatomy of a parallel computing system, MIT Press, Cambridge, MA, 1998.
  4. Intel Corp. iWarp Microprocessor (Part Number 318153), Hillsboro, Oregon, 1991. Technical Information, Order Number 281006.
  5. Borkar, S.; Cohn, R.; Cox, G.; Gleason, S.; Gross, T. (1988-11-01). "iWarp: an integrated solution of high-speed parallel computing". Proceedings of the 1988 ACM/IEEE Conference on Supercomputing. Supercomputing '88. Washington, DC, USA: IEEE Computer Society Press: 330–339. ISBN   978-0-8186-0882-7.
  6. Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, and Thomas Gross. iWarp: an integrated solution of high-speed parallel computing, Proceedings of the 1988 ACM/IEEE conference on Supercomputing, p.330-339, November 12–17, 1988.
  7. Annaratone, M. A. R. C. O., et al. "Applications experience on Warp." Proceedings of the 1987 National Computer Conference. 1987.
  8. Pomerleau; Gusciora; Touretzky; Kung (1988). "Neural network simulation at Warp speed: How we got 17 million connections per second". IEEE International Conference on Neural Networks. IEEE. pp. 143–150 vol.2. doi:10.1109/icnn.1988.23922. ISBN   0-7803-0999-5.
  9. Borkar, S.; Cohn, R.; Cox, G.; Gleason, S.; Gross, T. (1988-11-01). "iWarp: an integrated solution of high-speed parallel computing". Proceedings of the 1988 ACM/IEEE Conference on Supercomputing. Supercomputing '88. Washington, DC, USA: IEEE Computer Society Press: 330–339. ISBN   978-0-8186-0882-7.
  10. Monica S. Lam. A Systolic Array Optimizing Compiler, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1989.
  11. Ali-Reza Adl-Tabatabai, Thomas Gross, Guei-Yuan Lueh and James Reinders. Modeling Instruction-Level Parallelism for Software Pipelining. In Proceedings of the IFIP WG10.3 Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, Orlando, FL, pages 321-330.

Related Research Articles

<span class="mw-page-title-main">Supercomputer</span> Type of extremely powerful computer

A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2022, supercomputers have existed which can perform over 1018 FLOPS, so called exascale supercomputers. For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

iWarp was an experimental parallel supercomputer architecture developed as a joint project by Intel and Carnegie Mellon University. The project started in 1988, as a follow-up to CMU's previous WARP research project, in order to explore building an entire parallel-computing "node" in a single microprocessor, complete with memory and communications links. In this respect the iWarp is very similar to the INMOS transputer and nCUBE.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to add custom computational blocks using FPGAs. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric, thus providing new computational blocks without the need to manufacture and add new chips to the existing system.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II. Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations for banded matrices. Early applications include computing greatest common divisors of integers and polynomials. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

<span class="mw-page-title-main">Multiple instruction, single data</span> Parallel computing architecture

In computing, multiple instruction, single data (MISD) is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault tolerance executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Applications for this architecture are much less common than MIMD and SIMD, as the latter two are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources. However, one prominent example of MISD in computing are the Space Shuttle flight control computers.

<span class="mw-page-title-main">ASCI Red</span> Supercomputer

ASCI Red was the first computer built under the Accelerated Strategic Computing Initiative (ASCI), the supercomputing initiative of the United States government created to help the maintenance of the United States nuclear arsenal after the 1992 moratorium on nuclear testing.

Hsiang-Tsung Kung is a Taiwanese-born American computer scientist. He is the William H. Gates professor of computer science at Harvard University. His early research in parallel computing produced the systolic array in 1979, which has since become a core computational component of hardware accelerators for artificial intelligence, including Google's Tensor Processing Unit (TPU). Similarly, he proposed optimistic concurrency control in 1981, now a key principle in memory and database transaction systems, including MySQL, Apache CouchDB, Google's App Engine, and Ruby on Rails. He remains an active researcher, with ongoing contributions to computational complexity theory, hardware design, parallel computing, routing, wireless communication, signal processing, and artificial intelligence.

<span class="mw-page-title-main">Charles E. Leiserson</span> American computer scientist

Charles Eric Leiserson is a computer scientist and professor at Massachusetts Institute of Technology (M.I.T.). He specializes in the theory of parallel computing and distributed computing.

<span class="mw-page-title-main">NETtalk (artificial neural network)</span> Artificial neural network

NETtalk is an artificial neural network that learns to pronounce written English text by being shown text as input and matching phonetic transcriptions for comparison.

The Intel Personal SuperComputer was a product line of parallel computers in the 1980s and 1990s. The iPSC/1 was superseded by the Intel iPSC/2, and then the Intel iPSC/860.

Monica Sin-Ling Lam is an American computer scientist. She is a professor in the Computer Science Department at Stanford University.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs. CUDA was created by Nvidia in 2006. When it was first introduced, the name was an acronym for Compute Unified Device Architecture, but Nvidia later dropped the common use of the acronym and now rarely expands it.

In computer science, partitioned global address space (PGAS) is a parallel programming model paradigm. PGAS is typified by communication operations involving a global memory address space abstraction that is logically partitioned, where a portion is local to each process, thread, or processing element. The novelty of PGAS is that the portions of the shared memory space may have an affinity for a particular process, thereby exploiting locality of reference in order to improve performance. A PGAS memory model is featured in various parallel programming languages and libraries, including: Coarray Fortran, Unified Parallel C, Split-C, Fortress, Chapel, X10, UPC++, Coarray C++, Global Arrays, DASH and SHMEM. The PGAS paradigm is now an integrated part of the Fortran language, as of Fortran 2008 which standardized coarrays.

David J. Kuck, a graduate of the University of Michigan, was a professor in the Computer Science Department the University of Illinois at Urbana-Champaign from 1965 to 1993. He is the father of Olympic silver medalist Jonathan Kuck. While at the University of Illinois at Urbana-Champaign he developed the Parafrase compiler system (1977), which was the first testbed for the development of automatic vectorization and related program transformations. In his role as Director (1986–93) of the Center for Supercomputing Research and Development (CSRD-UIUC), Kuck led the construction of the CEDAR project, a hierarchical shared-memory 32-processor SMP supercomputer completed in 1988 at the University of Illinois.

In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. This rate is typically measured by performance on the LINPACK benchmark when trying to compare between computing systems: an example using this is the Green500 list of supercomputers. Performance per watt has been suggested to be a more sustainable measure of computing than Moore's Law.

The Ken Kennedy Award, established in 2009 by the Association for Computing Machinery and the IEEE Computer Society in memory of Ken Kennedy, is awarded annually and recognizes substantial contributions to programmability and productivity in computing and substantial community service or mentoring contributions. The award includes a $5,000 honorarium and the award recipient will be announced at the ACM - IEEE Supercomputing Conference.

<span class="mw-page-title-main">History of supercomputing</span>

The history of supercomputing goes back to the 1960s when a series of computers at Control Data Corporation (CDC) were designed by Seymour Cray to use innovative designs and parallelism to achieve superior computational peak performance. The CDC 6600, released in 1964, is generally considered the first supercomputer. However, some earlier computers were considered supercomputers for their day such as the 1954 IBM NORC in the 1950s, and in the early 1960s, the UNIVAC LARC (1960), the IBM 7030 Stretch (1962), and the Manchester Atlas (1962), all of which were of comparable power.

The DEGIMA is a high performance computer cluster used for hierarchical N-body simulations at the Nagasaki Advanced Computing Center, Nagasaki University.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

<span class="mw-page-title-main">ACM SIGARCH</span> ACMs Special Interest Group on computer architecture

ACM SIGARCH is the Association for Computing Machinery's Special Interest Group on computer architecture, a community of computer professionals and students from academia and industry involved in research and professional practice related to computer architecture and design. The organization sponsors many prestigious international conferences in this area, including the International Symposium on Computer Architecture (ISCA), recognized as the top conference in this area since 1975. Together with IEEE Computer Society's Technical Committee on Computer Architecture (TCCA), it is one of the two main professional organizations for people working in computer architecture.