Network on a chip

Last updated

A network on a chip or network-on-chip (NoC /ˌɛnˌˈs/ en-oh-SEE or /nɒk/ knock) [nb 1] is a network-based communications subsystem on an integrated circuit ("microchip"), most typically between modules in a system on a chip (SoC). The modules on the IC are typically semiconductor IP cores schematizing various functions of the computer system, and are designed to be modular in the sense of network science. The network on chip is a router-based packet switching network between SoC modules.

Contents

NoC technology applies the theory and methods of computer networking to on-chip communication and brings notable improvements over conventional bus and crossbar communication architectures. Networks-on-chip come in many network topologies, many of which are still experimental as of 2018.

Networks-on-chip improve the scalability of systems-on-chip and the power efficiency of complex SoCs compared to other communication subsystem designs. A very common NoC used in contemporary personal computers is a graphics processing unit (GPU), which is commonly used in computer graphics, video gaming and accelerating artificial intelligence. They are an emerging technology, with projections for large growth in the near future as manycore computer architectures become more common.

Structure

Networks-on-chip can span synchronous and asynchronous clock domains, known as clock domain crossing, or use unclocked asynchronous logic. NoCs support globally asynchronous, locally synchronous electronics architectures, allowing each processor core or functional unit on the System-on-Chip to have its own clock domain. [1]

Architectures

NoC architectures typically model sparse small-world networks (SWNs) and scale-free networks (SFNs) to limit the number, length, area and power consumption of interconnection wires and point-to-point connections.

Benefits

Traditionally, ICs have been designed with dedicated point-to-point connections, with one wire dedicated to each signal. This results in a dense network topology. For large designs, in particular, this has several limitations from a physical design viewpoint. It requires power quadratic in the number of interconnections. The wires occupy much of the area of the chip, and in nanometer CMOS technology, interconnects dominate both performance and dynamic power dissipation, as signal propagation in wires across the chip requires multiple clock cycles. This also allows more parasitic capacitance, resistance and inductance to accrue on the circuit. (See Rent's rule for a discussion of wiring requirements for point-to-point connections).

Sparsity and locality of interconnections in the communications subsystem yield several improvements over traditional bus-based and crossbar-based systems.

Parallelism and scalability

The wires in the links of the network-on-chip are shared by many signals. A high level of parallelism is achieved, because all data links in the NoC can operate simultaneously on different data packets.[ why? ] Therefore, as the complexity of integrated systems keeps growing, a NoC provides enhanced performance (such as throughput) and scalability in comparison with previous communication architectures (e.g., dedicated point-to-point signal wires, shared buses, or segmented buses with bridges). Of course, the algorithms [ which? ] must be designed in such a way that they offer large parallelism and can hence utilize the potential of NoC.

Current research

Some researchers[ who? ] think that NoCs need to support quality of service (QoS), namely achieve the various requirements in terms of throughput, end-to-end delays, fairness, [2] and deadlines.[ citation needed ] Real-time computation, including audio and video playback, is one reason for providing QoS support. However, current system implementations like VxWorks, RTLinux or QNX are able to achieve sub-millisecond real-time computing without special hardware.[ citation needed ]

This may indicate that for many real-time applications the service quality of existing on-chip interconnect infrastructure is sufficient, and dedicated hardware logic would be necessary to achieve microsecond precision, a degree that is rarely needed in practice for end users (sound or video jitter need only tenth of milliseconds latency guarantee). Another motivation for NoC-level quality of service (QoS) is to support multiple concurrent users sharing resources of a single chip multiprocessor in a public cloud computing infrastructure. In such instances, hardware QoS logic enables the service provider to make contractual guarantees on the level of service that a user receives, a feature that may be deemed desirable by some corporate or government clients.[ citation needed ]

Many challenging research problems remain to be solved at all levels, from the physical link level through the network level, and all the way up to the system architecture and application software. The first dedicated research symposium on networks on chip was held at Princeton University, in May 2007. [3] The second IEEE International Symposium on Networks-on-Chip was held in April 2008 at Newcastle University.

Research has been conducted on integrated optical waveguides and devices comprising an optical network on a chip (ONoC). [4] [5]

Side benefits of NoC: Cache miss pattern prediction and data forwarding leveraging augmented switches

In a multi-core system, connected by NoC, coherency messages and cache miss requests have to pass switches. Accordingly, switches can be augmented with simple tracking and forwarding elements to detect which cache blocks will be requested in the future by which cores. Then, the forwarding elements multicast any requested block to all the cores that may request the block in the future. This mechanism reduces cache miss rate . [6]

Benchmarks

NoC development and studies require comparing different proposals and options. NoC traffic patterns are under development to help such evaluations. Existing NoC benchmarks include NoCBench and MCSL NoC Traffic Patterns. [7]

Interconnect processing unit

An interconnect processing unit (IPU) [8] is an on-chip communication network with hardware and software components which jointly implement key functions of different system-on-chip programming models through a set of communication and synchronization primitives and provide low-level platform services to enable advanced features[ which? ] in modern heterogeneous applications[ definition needed ] on a single die.

Commercial providers

See also

Notes

  1. This article uses the convention that "NoC" is pronounced /nɒk/ nock. Therefore, it uses the convention "a" for the indefinite article corresponding to NoC ("a NoC"). Other sources may pronounce it as /ˌɛnˌˈs/ en-oh-SEE and therefore use "an NoC".

Related Research Articles

Transputer Series of pioneering microprocessors from the 1980s

The transputer is a series of pioneering microprocessors from the 1980s, featuring integrated memory and serial communication links, intended for parallel computing. They were designed and produced by Inmos, a semiconductor company based in Bristol, United Kingdom.

System on a chip type of integrated circuit

A system on chip is an integrated circuit that integrates all components of a computer or other electronic system. These components typically include a central processing unit (CPU), memory, input/output ports and secondary storage – all on a single substrate or microchip, the size of a coin. It may contain digital, analog, mixed-signal, and often radio frequency signal processing functions, depending on the application. As they are integrated on a single substrate, SoCs consume much less power and take up much less area than multi-chip designs with equivalent functionality. Because of this, SoCs are very common in the mobile computing and edge computing markets. Systems-on-chip are typically fabricated using metal–oxide–semiconductor (MOS) technology, and are commonly used in embedded systems and the Internet of Things.

MIMD class of parallel computer architecture in Flynns taxonomy, in which multiple operations are performed on multiple data points simultaneously

In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.

Front-side bus computer communication interface (bus) often used in Intel-chip-based computers during the 1990s and 2000s; replaced by replaced by HyperTransport, Intel QuickPath Interconnect or Direct Media Interface in modern CPUs

A front-side bus (FSB) is a computer communication interface (bus) that was often used in Intel-chip-based computers during the 1990s and 2000s. The EV6 bus served the same function for competing AMD CPUs. Both typically carry data between the central processing unit (CPU) and a memory controller hub, known as the northbridge.

Robert Drost American computer scientist

Robert Drost is an American computer scientist. He was born in 1970 in New York City.

Microarchitecture the way a given instruction set architecture (ISA) is implemented on a processor

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

An asynchronous circuit, or self-timed circuit, is a sequential digital logic circuit which is not governed by a clock circuit or global clock signal. Instead it often uses signals that indicate completion of instructions and operations, specified by simple data transfer protocols. This type of circuit is contrasted with synchronous circuits, in which changes to the signal values in the circuit are triggered by repetitive pulses called a clock signal. Most digital devices today use synchronous circuits. However asynchronous circuits have the potential to be faster, and may also have advantages in lower power consumption, lower electromagnetic interference, and better modularity in large systems. Asynchronous circuits are an active area of research in digital logic design.

Rent's rule pertains to the organization of computing logic, specifically the relationship between the number of external signal connections to a logic block with the number of logic gates in the logic block, and has been applied to circuits ranging from small digital circuits to mainframe computers.

Multi-core processor Microprocessor with more than one processing unit

A multi-core processor is a computer processor integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions, as if the computer had several processors. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core. A multi-core processor implements multiprocessing in a single physical package. Designers may couple cores in a multi-core device tightly or loosely. For example, cores may or may not share caches, and they may implement message passing or shared-memory inter-core communication methods. Common network topologies to interconnect cores include bus, ring, two-dimensional mesh, and crossbar. Homogeneous multi-core systems include only identical cores; heterogeneous multi-core systems have cores that are not identical. Just as with single-processor systems, cores in multi-core systems may implement architectures such as VLIW, superscalar, vector, or multithreading.

The primary focus of this article is asynchronous control in digital electronic systems. In a synchronous system, operations are coordinated by one, or more, centralized clock signals. An asynchronous digital system, in contrast, has no global clock. Asynchronous systems do not depend on strict arrival times of signals or messages for reliable operation. Coordination is achieved via events such as: packet arrival, changes (transitions) of signals, handshake protocols, and other methods.

The interconnect bottleneck comprises limits on integrated circuit (IC) performance due to connections between components instead of their internal speed. In 2006 it was predicted to be a "looming crisis" by 2010.

A multiprocessor system-on-chip is a system-on-a-chip (SoC) which includes multiple microprocessors. As such, it is a multi-core System-on-Chip.

SGI Origin 2000 Series of server computers

The SGI Origin 2000 is a family of mid-range and high-end server computers developed and manufactured by Silicon Graphics (SGI). They were introduced in 1996 to succeed the SGI Challenge and POWER Challenge. At the time of introduction, these ran the IRIX operating system, originally version 6.4 and later, 6.5. A variant of the Origin 2000 with graphics capability is known as the Onyx2. An entry-level variant based on the same architecture but with a different hardware implementation is known as the Origin 200. The Origin 2000 was succeeded by the Origin 3000 in July 2000, and was discontinued on June 30, 2002.

A massively parallel processor array, also known as a multi purpose processor array (MPPA) is a type of integrated circuit which has a massively parallel array of hundreds or thousands of CPUs and RAM memories. These processors pass work to one another through a reconfigurable interconnect of channels. By harnessing a large number of processors working in parallel, an MPPA chip can accomplish more demanding tasks than conventional chips. MPPAs are based on a software parallel programming model for developing high-performance embedded system applications.

The XSwitch is an interconnect used by the XCore processor. The interconnect protocol is defined by XMOS, and is based around routing messages comprising 9-bit tokens between cores on a network. The protocol is specifically designed for on-chip and board-level communication, but using LVDS drivers it can also run over longer cables.

This is a glossary of terms relating to computer hardware – physical computer hardware, architectural issues, and peripherals.

Globally asynchronous locally synchronous (GALS) is an architecture for designing electronic circuits which addresses the problem of safe and reliable data transfer between independent clock domains. GALS is a Model of Computation (MoC) that emerged in the 1980s. It allows to design computer systems consisting of several synchronous islands interacting with other islands using asynchronous communication, e.g. with FIFOs.

Torus interconnect switch-less network topology

A torus interconnect is a switch-less network topology for connecting processing nodes in a parallel computer system.

Heterogeneous computing refers to systems that use more than one kind of processor or cores. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

The SW26010 is a 260-core manycore processor designed by the National High Performance Integrated Circuit Design Center in Shanghai. It implements the Sunway architecture, a 64-bit reduced instruction set computing (RISC) architecture designed in China. The SW26010 has four clusters of 64 Compute-Processing Elements (CPEs) which are arranged in an eight-by-eight array. The CPEs support single instruction, multiple data (SIMD) instructions, and are capable of performing eight double-precision floating-point operations per cycle. Each cluster is accompanied by a more conventional general-purpose core called the Management Processing Element (MPE) that provides supervisory functions. Each cluster has its own dedicated DDR3 SDRAM controller, and a memory bank with its own address space. The processor runs at a clock speed of 1.45 GHz.

References

  1. Kundu, Santanu; Chattopadhyay, Santanu (2014). Network-on-chip: the Next Generation of System-on-Chip Integration (1st ed.). Boca Raton, FL: CRC Press. p. 3. ISBN   9781466565272. OCLC   895661009.
  2. "Balancing On-Chip Network Latency in Multi-Application Mapping for Chip-Multiprocessors". IPDPS. May 2014.
  3. NoCS 2007 website.
  4. On-Chip Networks Bibliography
  5. Inter/Intra-Chip Optical Network Bibliography-
  6. Marzieh Lenjani, Mahmoud Reza Hashemi (2014). "Tree-based scheme for reducing shared cache miss rate leveraging regional, statistical and temporal similarities". IET Computers & Digital Techniques. 8: 30–48. doi:10.1049/iet-cdt.2011.0066.CS1 maint: uses authors parameter (link)
  7. "NoC traffic". www.ece.ust.hk. Retrieved 2018-10-08.
  8. Marcello Coppola, Miltos D. Grammatikakis, Riccardo Locatelli, Giuseppe Maruccia, Lorenzo Pieralisi, "Design of Cost-Efficient Interconnect Processing Units: Spidergon STNoC", CRC Press, 2008, ISBN   978-1-4200-4471-3

Adapted from Avinoam Kolodny's's column in the ACM SIGDA e-newsletter by Igor Markov
The original text can be found at http://www.sigda.org/newsletter/2006/060415.txt

Further reading