Scalable Coherent Interface

Last updated
Scalable Coherent Interface and Serial Express Users, Developers, and Manufacturers Association
SCIzzL logo.gif
Group that supports the standard
AbbreviationSCIzzL
Formation1996
TypeNon-profit
Website www.scizzl.com

The Scalable Coherent Interface or Scalable Coherent Interconnect (SCI), is a high-speed interconnect standard for shared memory multiprocessing and message passing. The goal was to scale well, provide system-wide memory coherence and a simple interface; i.e. a standard to replace existing buses in multiprocessor systems with one with no inherent scalability and performance limitations.

Contents

The IEEE Std 1596-1992, IEEE Standard for Scalable Coherent Interface (SCI) was approved by the IEEE standards board on March 19, 1992. [1] It saw some use during the 1990s, but never became widely used and has been replaced by other systems from the early 2000s.

History

Soon after the Fastbus (IEEE 960) follow-on Futurebus (IEEE 896) project in 1987, some engineers predicted it would already be too slow for the high performance computing marketplace by the time it would be released in the early 1990s. In response, a "Superbus" study group was formed in November 1987. Another working group of the standards association of the Institute of Electrical and Electronics Engineers (IEEE) spun off to form a standard targeted at this market in July 1988. [2] It was essentially a subset of Futurebus features that could be easily implemented at high speed, along with minor additions to make it easier to connect to other systems, such as VMEbus. Most of the developers had their background from high-speed computer buses. Representatives from companies in the computer industry and research community included Amdahl, Apple Computer, BB&N, Hewlett-Packard, CERN, Dolphin Server Technology, Cray Research, Sequent, AT&T, Digital Equipment Corporation, McDonnell Douglas, National Semiconductor, Stanford Linear Accelerator Center, Tektronix, Texas Instruments, Unisys, University of Oslo, University of Wisconsin.

The original intent was a single standard for all buses in the computer. [3] The working group soon came up with the idea of using point-to-point communication in the form of insertion rings. This avoided the lumped capacitance, limited physical length/speed of light problems and stub reflections in addition to allowing parallel transactions. The use of insertion rings is credited to Manolis Katevenis who suggested it at one of the early meetings of the working group. The working group for developing the standard was led by David B. Gustavson (chair) and David V. James (Vice Chair). [4]

David V. James was a major contributor for writing the specifications including the executable C-code.[ citation needed ] Stein Gjessing’s group at the University of Oslo used formal methods to verify the coherence protocol and Dolphin Server Technology implemented a node controller chip including the cache coherence logic.

Block diagram of one example NUMA.svg
Block diagram of one example

Different versions and derivatives of SCI were implemented by companies like Dolphin Interconnect Solutions, Convex, Data General AViiON (using cache controller and link controller chips from Dolphin), Sequent and Cray Research. Dolphin Interconnect Solutions implemented a PCI and PCI-Express connected derivative of SCI that provides non-coherent shared memory access. This implementation was used by Sun Microsystems for its high-end clusters, Thales Group and several others including volume applications for message passing within HPC clustering and medical imaging. SCI was often used to implement non-uniform memory access architectures. It was also used by Sequent Computer Systems as the processor memory bus in their NUMA-Q systems. Numascale developed a derivative to connect with coherent HyperTransport.

The standard

The standard defined two interface levels:

This structure allowed new developments in physical interface technology to be easily adapted without any redesign on the logical level.

Scalability for large systems is achieved through a distributed directory-based cache coherence model. (The other popular models for cache coherency are based on system-wide eavesdropping (snooping) of memory transactions – a scheme which is not very scalable.) In SCI each node contains a directory with a pointer to the next node in a linked list that shares a particular cache line.

SCI defines a 64-bit flat address space (16 exabytes) where 16 bits are used for identifying a node (65,536 nodes) and 48 bits for address within the node (256 terabytes). A node can contain many processors and/or memory. The SCI standard defines a packet switched network.

Topologies

SCI can be used to build systems with different types of switching topologies from centralized to fully distributed switching:

The most common way to describe these multi-dimensional topologies is k-ary n-cubes (or tori). The SCI standard specification mentions several such topologies as examples.

The 2-D torus is a combination of rings in two dimensions. Switching between the two dimensions requires a small switching capability in the node. This can be expanded to three or more dimensions. The concept of folding rings can also be applied to the Torus topologies to avoid any long connection segments.

Transactions

SCI sends information in packets. Each packet consists of an unbroken sequence of 16-bit symbols. The symbol is accompanied by a flag bit. A transition of the flag bit from 0 to 1 indicates the start of a packet. A transition from 1 to 0 occurs 1 (for echoes) or 4 symbols before the packet end. A packet contains a header with address command and status information, payload (from 0 through optional lengths of data) and a CRC check symbol. The first symbol in the packet header contains the destination node address. If the address is not within the domain handled by the receiving node, the packet is passed to the output through the bypass FIFO. In the other case, the packet is fed to a receive queue and may be transferred to a ring in another dimension. All packets are marked when they pass the scrubber (a node is established as scrubber when the ring is initialized). Packets without a valid destination address will be removed when passing the scrubber for the second time to avoid filling the ring with packets that would otherwise circulate indefinitely.

Cache coherence

Cache coherence ensures data consistency in multiprocessor systems. The simplest form applied in earlier systems was based on clearing the cache contents between context switches and disabling the cache for data that were shared between two or more processors. These methods were feasible when the performance difference between the cache and memory were less than one order of magnitude. Modern processors with caches that are more than two orders of magnitude faster than main memory would not perform anywhere near optimal without more sophisticated methods for data consistency. Bus based systems use eavesdropping (snooping) methods since buses are inherently broadcast. Modern systems with point-to point links use broadcast methods with snoop filter options to improve performance. Since broadcast and eavesdropping are inherently non-scalable, these are not used in SCI.

Instead, SCI uses a distributed directory-based cache coherence protocol with a linked list of nodes containing processors that share a particular cache line. Each node holds a directory for the main memory of the node with a tag for each line of memory (same line length as the cache line). The memory tag holds a pointer to the head of the linked list and a state code for the line (three states – home, fresh, gone). Associated with each node is also a cache for holding remote data with a directory containing forward and backward pointers to nodes in the linked list sharing the cache line. The tag for the cache has seven states (invalid, only fresh, head fresh, only dirty, head dirty, mid valid, tail valid).

The distributed directory is scalable. The overhead for the directory based cache coherence is a constant percentage of the node’s memory and cache. This percentage is in the order of 4% for the memory and 7% for the cache.

Legacy

SCI is a standard for connecting the different resources within a multiprocessor computer system, and it is not as widely known to the public as for example the Ethernet family for connecting different systems. Different system vendors implemented different variants of SCI for their internal system infrastructure. These different implementations interface to very intricate mechanisms in processors and memory systems and each vendor has to preserve some degrees of compatibility for both hardware and software.

Gustavson led a group called the Scalable Coherent Interface and Serial Express Users, Developers, and Manufacturers Association and maintained a web site for the technology starting in 1996. [3] A series of workshops were held through 1999. After the first 1992 edition, [1] follow-on projects defined shared data formats in 1993, [5] a version using low-voltage differential signaling in 1996, [6] and a memory interface known as Ramlink later in 1996. [7] In January 1998, the SLDRAM corporation was formed to hold patents on an attempt to define a new memory interface that was related to another working group called SerialExpress or Local Area Memory Port. [8] [9] However, by early 1999 the new memory standard was abandoned. [10]

In 1999 a series of papers was published as a book on SCI. [11] An updated specification was published in July 2000 by the International Electrotechnical Commission (IEC) of the International Organization for Standardization (ISO) as ISO/IEC 13961. [12]

See also

Related Research Articles

Non-uniform memory access Computer memory design used in multiprocessing

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

Network topology Arrangement of the elements of a communication network

Network topology is the arrangement of the elements of a communication network. Network topology can be used to define or describe the arrangement of various types of telecommunication networks, including command and control radio networks, industrial fieldbusses and computer networks.

Futurebus, or IEEE 896, is a computer bus standard, intended to replace all local bus connections in a computer, including the CPU, memory, plug-in cards and even, to some extent, LAN links between machines. The effort started in 1979 and didn't complete until 1987, and then immediately went into a redesign that lasted until 1994. By this point, implementation of a chip-set based on the standard lacked industry leadership. It has seen little real-world use, although custom implementations continue to be designed and used throughout industry.

QuickRing was a gigabit-rate interconnect that combined the functions of a computer bus and a network. It was designed at Apple Computer as a multimedia system to run "on top" of existing local bus systems inside a computer, but was later taken over by National Semiconductor and repositioned as an interconnect for parallel computing. It appears to have seen little use in either role, and is no longer being actively worked on. However it appears to have been an inspiration for other more recent technologies, such as HyperTransport.

Cache coherence Property of shared data being read as the same value across different execution units, without data races

In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system.

Dolphin Interconnect Solutions Manufacturer of high speed data communication systems

Dolphin Interconnect Solutions is a privately held manufacturer of high speed data communication systems, headquartered in Oslo, Norway and Woodsville, New Hampshire, United States.

Content-addressable memory

Content-addressable memory (CAM) is a special type of computer memory used in certain very-high-speed searching applications. It is also known as associative memory or associative storage and compares input search data against a table of stored data, and returns the address of matching data.

Bus snooping or bus sniffing is a scheme by which a coherency controller (snooper) in a cache monitors or snoops the bus transactions, and its goal is to maintain a cache coherency in distributed shared memory systems. A cache containing a coherency controller (snooper) is called a snoopy cache. This scheme was introduced by Ravishankar and Goodman in 1983.

In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as one logically shared address space. Here, the term "shared" does not mean that there is a single centralized memory, but that the address space is "shared". Distributed global address space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's non-shared private memory.

Cache only memory architecture (COMA) is a computer memory organization for use in multiprocessors in which the local memories at each node are used as cache. This is in contrast to using the local memories as actual main memory, as in NUMA organizations.

RapidIO Electrical connection technology

The RapidIO architecture is a high-performance packet-switched interconnect technology. RapidIO supports messaging, read/write and cache coherency semantics. RapidIO fabrics guarantee in-order packet delivery, enabling power- and area- efficient protocol implementation in hardware. Based on industry-standard electrical specifications such as those for Ethernet, RapidIO can be used as a chip-to-chip, board-to-board, and chassis-to-chassis interconnect. The protocol is marketed as: RapidIO - the unified fabric for Performance Critical Computing, and is used in many applications such as Data Center & HPC, Communications Infrastructure, Industrial Automation and Military & Aerospace that are constrained by at least one of size, weight, and power (SWaP).

Fireplane is a computer internal interconnect created by Sun Microsystems.

IEEE 1355

IEEE Standard 1355-1995, IEC 14575, or ISO 14575 is a data communications standard for Heterogeneous Interconnect (HIC).

Computer network Network that allows computers to share resources and communicate with each other

A computer network is a set of computers sharing resources located on or provided by network nodes. The computers use common communication protocols over digital interconnections to communicate with each other. These interconnections are made up of telecommunication network technologies, based on physically wired, optical, and wireless radio-frequency methods that may be arranged in a variety of network topologies.

Forwarding plane

In routing, the forwarding plane, sometimes called the data plane or user plane, defines the part of the router architecture that decides what to do with packets arriving on an inbound interface. Most commonly, it refers to a table in which the router looks up the destination address of the incoming packet and retrieves the information necessary to determine the path from the receiving element, through the internal forwarding fabric of the router, and to the proper outgoing interface(s).

SGI Origin 2000 Series of server computers

The SGI Origin 2000 is a family of mid-range and high-end server computers developed and manufactured by Silicon Graphics (SGI). They were introduced in 1996 to succeed the SGI Challenge and POWER Challenge. At the time of introduction, these ran the IRIX operating system, originally version 6.4 and later, 6.5. A variant of the Origin 2000 with graphics capability is known as the Onyx2. An entry-level variant based on the same architecture but with a different hardware implementation is known as the Origin 200. The Origin 2000 was succeeded by the Origin 3000 in July 2000, and was discontinued on June 30, 2002.

The Origin 3000 and the Onyx 3000 is a family of mid-range and high-end computers developed and manufactured by SGI. The Origin 3000 is a server, while the Onyx 3000 is a visualization system. Both systems were introduced in July 2000 to succeed the Origin 2000 and the Onyx2 respectively. These systems ran the IRIX 6.5 Advanced Server Environment operating system. Entry-level variants of these systems based on the same architecture but with a different hardware implementation are known as the Origin 300 and Onyx 300. The Origin 3000 was succeeded by the Altix 3000 in 2004 and the last model was discontinued on 29 December 2006, while the Onyx 3000 was succeeded by the Onyx4 and the Itanium-based Prism in 2004 and the last model was discontinued on 25 March 2005.

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard, initially designed to be layered on top of PCI Express, for directly connecting CPUs to external accelerators like GPUs, ASICs, FPGAs or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

Directory-based coherence is a mechanism to handle Cache coherence problem in Distributed shared memory (DSM) a.k.a. Non-Uniform Memory Access (NUMA). Another popular way is to use a special type of computer bus between all the nodes as a "shared bus". Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols. Both of these designs use the corresponding medium as tool to facilitate the communication between different nodes, and to guarantee that the coherence protocol is working properly along all the communicating nodes. In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

Butterfly network

A butterfly network is a technique to link multiple computers into a high-speed network. This form of multistage interconnection network topology can be used to connect different nodes in a multiprocessor system. The interconnect network for a shared memory multiprocessor system must have low latency and high bandwidth unlike other network systems, like local area networks (LANs) or internet for three reasons:

References

  1. 1 2 IEEE Standard for Scalable Coherent Interface (SCI). IEEE Std 1596-1992. IEEE Standards Board. 1992. ISBN   9780738129501.
  2. David B. Gustavson (September 1991). "The Scalable Coherent Interface and Related Standards Projects" (PDF). SLAC Publication 5656. Stanford Linear Accelerator Center . Retrieved August 31, 2013.
  3. 1 2 "Scalable Coherent Interface and Serial Express Users, Developers, and Manufacturers Association". Group web site. Retrieved August 31, 2013.
  4. "1596 WG - Working Group for Scalable Coherent Interface". Working group web site. Retrieved August 31, 2013.
  5. IEEE Standard for Shared-Data Formats Optimized for Scalable Coherent Interface (SCI) Processors. IEEE 1596.5-1993. IEEE Standards Board. April 25, 1994. ISBN   9780738112091.
  6. IEEE Standard for Low-Voltage Differential Signals (LVDS) for Scalable Coherent Interface (SCI). IEEE IEEE 1596.3-1996. IEEE Standards Board. July 31, 1996. ISBN   9780738131368.
  7. EEE Standard for High-Bandwidth Memory Interface Based on Scalable Coherent Interface (SCI) Signaling Technology (RamLink). IEEE IEEE 1596.4-1996. IEEE Standards Board. September 16, 1996. ISBN   9780738131375.
  8. David B. Gustavson (February 10, 1999). "Organizing for Alternatives".
  9. David V. James; David B. Gustavson; B. Fleischer (May–Jun 1998). "SerialExpress-a high performance workstation interconnect". IEEE Micro. IEEE. 18 (3): 54–65. doi:10.1109/40.683105.
  10. David Lammers (February 19, 1999). "ISSCC: SLDRAM group morphs to DDR II". EE Times.
  11. Hermann Hellwagner; Alexander Reinefeld, eds. (1999). SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters. Lecture Notes in Computer Science. Springer. ISBN   978-3540666967.
  12. Scalable Coherent Interface (SCI) (PDF). International Standard ISO/IEC 13961 IEEE Std 1596. July 10, 2000.