Load-balanced switch

Last updated September 15, 2022

A load-balanced switch is a switch architecture which guarantees 100% throughput with no central arbitration at all, at the cost of sending each packet across the crossbar twice. Load-balanced switches are a subject of research for large routers scaled past the point of practical central arbitration.^{[ vague ]}

Introduction

Internet routers are typically built using line cards connected with a switch. Routers supporting moderate total bandwidth may use a bus as their switch, but high bandwidth routers typically use some sort of crossbar interconnection. In a crossbar, each output connects to one input, so that information can flow through every output simultaneously. Crossbars used for packet switching are typically reconfigured tens of millions of times per second. The schedule of these configurations is determined by a central arbiter, for example a Wavefront arbiter, in response to requests by the line cards to send information to one another.

Perfect arbitration would result in throughput limited only by the maximum throughput of each crossbar input or output. For example, if all traffic coming into line cards A and B is destined for line card C, then the maximum traffic that cards A and B can process together is limited by C. Perfect arbitration has been shown to require massive amounts of computation, that scales up much faster than the number of ports on the crossbar. Practical systems use imperfect arbitration heuristics (such as iSLIP) that can be computed in reasonable amounts of time.

A load-balanced switch is not related to a load balancing switch, which refers to a kind of router used as a front end to a farm of web servers to spread requests to a single website across many servers.

Basic architecture

As shown in the figure to the right, a load-balanced switch has N input line cards, each of rate R, each connected to N buffers by a link of rate R/N. Those buffers are in turn each connected to N output line cards, each of rate R, by links of rate R/N. The buffers in the center are partitioned into N virtual output queues.

Each input line card spreads its packets evenly to the N buffers, something it can clearly do without contention. Each buffer writes these packets into a single buffer-local memory at a combined rate of R. Simultaneously, each buffer sends packets at the head of each virtual output queue to each output line card, again at rate R/N to each card. The output line card can clearly forward these packets out the line with no contention.

Each buffer in a load-balanced switch acts as a shared-memory switch, and a load-balanced switch is essentially a way to scale up a shared-memory switch, at the cost of additional latency associated with forwarding packets at rate R/N twice.

The Stanford group investigating load-balanced switches is concentrating on implementations where the number of buffers is equal to the number of line cards. One buffer is placed on each line cards, and the two interconnection meshes are actually the same mesh, supplying rate 2R/N between every pair of line cards. But the basic load-balanced switch architecture does not require that the buffers be placed on the line cards, or that there be the same number of buffers and line cards.

One interesting property of a load-balanced switch is that, although the mesh connecting line cards to buffers is required to connect every line card to every buffer, there is no requirement that the mesh act as a non-blocking crossbar, nor that the connections be responsive to any traffic pattern. Such a connection is far simpler than a centrally arbitrated crossbar.

Keeping packets in-order

If two packets destined for the same output arrive back-to-back at one line card, they will be spread to two different buffers, which could have two different occupancies, and so the packets could be reordered by the time they are delivered to the output. Although reordering is legal, it is typically undesirable because TCP does not perform well with reordered packets.

By adding yet more latency and buffering, the load-balanced switch can maintain packet order within flows using only local information. One such algorithm is FOFF (Fully Ordered Frames First). FOFF has the additional benefits of removing any vulnerability to pathological traffic patterns, and providing a mechanism for implementing priorities.

Implementations

Single chip crossbar plus load-balancing arbiter

The Stanford University Tiny Tera project (see Abrizio) introduced a switch architecture that required at least two chip designs for the switching fabric itself (the crossbar slice and the arbiter). Upgrading the arbiter to include load-balancing and combining these devices could have reliability, cost and throughput advantages.

Single global router

Since the line cards in a load-balanced switch do not need to be physically near one another, one possible implementation is to use an entire continent- or global-sized backbone network as the interconnection mesh, and core routers as the "line cards". Such an implementation suffers from having all latencies increased to twice the worst-case transmission latency. But it has a number of intriguing advantages:

Large backbone packet networks typically have massive overcapacity (10x or more) to deal with imperfect capacity planning, congestion, and other problems. A load-balanced switch backbone can deliver 100% throughput with an overcapacity of just 2x, as measured across the whole system.
The underpinnings of large backbone networks are usually optical channels that cannot be quickly switched. These map well to the constant-rate 2R/N channels of the load-balanced switch's mesh.
No route tables need be changed based on global congestion information, because there is no global congestion.
Rerouting in the case of a node failure does require changing the configuration of the optical channels. But the reroute can be precomputed (there are only a finite number of nodes that can fail), and the reroute causes no congestion that would then require further route table changes.

Related Research Articles

Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network or a cloud computing service, particularly the performance seen by the users of the network. To quantitatively measure quality of service, several related aspects of the network service are often considered, such as packet loss, bit rate, throughput, transmission delay, availability, jitter, etc.

Network throughput refers to the rate of successful message delivery over a communication channel, such as Ethernet or packet radio, in a communication network.The data that these messages contain may be delivered over physical or logical links, or through network nodes. Throughput is usually measured in bits per second, and sometimes in data packets per second or data packets per time slot.

Frame Relay is a standardized wide area network (WAN) technology that specifies the physical and data link layers of digital telecommunications channels using a packet switching methodology. Originally designed for transport across Integrated Services Digital Network (ISDN) infrastructure, it may be used today in the context of many other network interfaces.

Wormhole flow control, also called wormhole switching or wormhole routing, is a system of simple flow control in computer networking based on known fixed links. It is a subset of flow control methods called Flit-Buffer Flow Control.

In computer networking, the transport layer is a conceptual division of methods in the layered architecture of protocols in the network stack in the Internet protocol suite and the OSI model. The protocols of this layer provide end-to-end communication services for applications. It provides services such as connection-oriented communication, reliability, flow control, and multiplexing.

Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle. Typical effects include queueing delay, packet loss or the blocking of new connections. A consequence of congestion is that an incremental increase in offered load leads either only to a small increase or even a decrease in network throughput.

The RapidIO architecture is a high-performance packet-switched electrical connection technology. RapidIO supports messaging, read/write and cache coherency semantics. Based on industry-standard electrical specifications such as those for Ethernet, RapidIO can be used as a chip-to-chip, board-to-board, and chassis-to-chassis interconnect.

A Wavefront arbiter is a circuit used to make decisions which control the crossbar of a high capacity switch fabric in parallel. It was commercialized in the TT1 and TTx chip sets designed by Abrizio and sold by PMC-Sierra.

Packet loss occurs when one or more packets of data travelling across a computer network fail to reach their destination. Packet loss is either caused by errors in data transmission, typically across wireless networks, or network congestion. Packet loss is measured as a percentage of packets lost with respect to packets sent.

A computer network is a set of computers sharing resources located on or provided by network nodes. The computers use common communication protocols over digital interconnections to communicate with each other. These interconnections are made up of telecommunication network technologies, based on physically wired, optical, and wireless radio-frequency methods that may be arranged in a variety of network topologies.

Head-of-line blocking in computer networking is a performance-limiting phenomenon that occurs when a line of packets is held up in a queue by a first packet. Examples include input buffered network switches, out-of-order delivery and multiple requests in HTTP pipelining.

Virtual output queueing (VOQ) is a technique used in certain network switch architectures where, rather than keeping all traffic in a single queue, separate queues are maintained for each possible output location. It addresses a common problem known as head-of-line blocking.

Multistage interconnection networks (MINs) are a class of high-speed computer networks usually composed of processing elements (PEs) on one end of the network and memory elements (MEs) on the other end, connected by switching elements (SEs). The switching elements themselves are usually connected to each other in stages, hence the name.

ITU-T Y.156sam Ethernet Service Activation Test Methodology is a draft recommendation under study by the ITU-T describing a new testing methodology adapted to the multiservice reality of packet-based networks.

Bufferbloat is a cause of high latency and jitter in packet-switched networks caused by excess buffering of packets. Bufferbloat can also cause packet delay variation, as well as reduce the overall network throughput. When a router or switch is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice over IP (VoIP), audio streaming, online gaming, and even ordinary web browsing.

ITU-T Y.1564 is an Ethernet service activation test methodology, which is the new ITU-T standard for turning up, installing and troubleshooting Ethernet-based services. It is the only standard test methodology that allows for complete validation of Ethernet service-level agreements (SLAs) in a single test.

A network scheduler, also called packet scheduler, queueing discipline (qdisc) or queueing algorithm, is an arbiter on a node in a packet switching communication network. It manages the sequence of network packets in the transmit and receive queues of the protocol stack and network interface controller. There are several network schedulers available for the different operating systems, that implement many of the existing network scheduling algorithms.

In computer networking, a flit is a link-level atomic piece that forms a network packet or stream. The first flit, called the header flit holds information about this packet's route and sets up the routing behavior for all subsequent flits associated with the packet. The header flit is followed by zero or more body flits, containing the actual payload of data. The final flit, called the tail flit, performs some book keeping to close the connection between the two nodes.

The STC104 switch, also known as the C104 switch in its early phases, is an asynchronous packet-routing chip that was designed for building high-performance point-to-point computer communication networks. It was developed by INMOS in the 1990s and was the first example of a general-purpose production packet routing chip. It was also the first routing chip to implement wormhole routing, to decouple packet size from the flow-control protocol, and to implement interval and two-phase randomized routing.

Deterministic Networking (DetNet) is an effort by the IETF DetNet Working Group to study implementation of deterministic data paths for real-time applications with extremely low data loss rates, packet delay variation (jitter), and bounded latency, such as audio and video streaming, industrial automation, and vehicle control.

References

External links

Optimal Load-Balancing I. Keslassy, C. Chang, N. McKeown, and D. Lee
Scaling Internet Routers Using Optics I. Keslassy, S. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard, and N. McKeown

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.