Fat tree

Last updated
A fat tree Fat tree network.svg
A fat tree
A 2-level fat tree with 8-port switches Fat-tree.svg
A 2-level fat tree with 8-port switches

The fat tree network is a universal network for provably efficient communication. [1] It was invented by Charles E. Leiserson of the Massachusetts Institute of Technology in 1985. [1] k-ary n-trees, the type of fat-trees commonly used in most high-performance networks, were initially formalized in 1997. [2]

Contents

In a tree data structure, every branch has the same thickness, regardless of their place in the hierarchy—they are all "skinny" (skinny in this context means low-bandwidth). In a fat tree, branches nearer the top of the hierarchy are "fatter" (thicker) than branches further down the hierarchy. In a telecommunications network, the branches are data links; the varied thickness (bandwidth) of the data links allows for more efficient and technology-specific use.[ citation needed ]

Mesh and hypercube topologies have communication requirements that follow a rigid algorithm, and cannot be tailored to specific packaging technologies. [3]

Applications in supercomputers

Supercomputers that use a fat tree network [4] include the two fastest as of late 2018, [5] Summit [6] and Sierra, [7] as well as Tianhe-2, [8] the Meiko Scientific CS-2, Yellowstone, the Earth Simulator, the Cray X2, the Connection Machine CM-5, and various Altix supercomputers.[ citation needed ]

Mercury Computer Systems applied a variant of the fat tree topology—the hypertree network—to their multicomputers.[ citation needed ] In this architecture, 2 to 360 compute nodes are arranged in a circuit-switched fat tree network.[ citation needed ] Each node has local memory that can be mapped by any other node.[ vague ] Each node in this heterogeneous system could be an Intel i860, a PowerPC, or a group of three SHARC digital signal processors.[ citation needed ]

The fat tree network was particularly well suited to fast Fourier transform computations, which customers used for such signal processing tasks as radar, sonar, and medical imaging.[ citation needed ]

In August 2008, a team of computer scientists at UCSD published a scalable design for network architecture [9] that uses a topology inspired by the fat tree topology to realize networks that scale better than those of previous hierarchical networks. The architecture uses commodity switches that are cheaper and more power-efficient than high-end modular data center switches.

This topology is actually a special instance of a Clos network, rather than a fat-tree as described above. That is because the edges near the root are emulated by many links to separate parents instead of a single high-capacity link to a single parent. However, many authors continue to use the term in this way.

Related Research Articles

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computing is a field of computer science that studies distributed systems.

<span class="mw-page-title-main">Supercomputer</span> Type of extremely powerful computer

A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there have existed supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

PARAM is a series of supercomputers designed and assembled by the Centre for Development of Advanced Computing (C-DAC) in Pune. PARAM means "supreme" in the Sanskrit language, whilst also creating an acronym for "PARAllel Machine". As of November 2022 the fastest machine in the series is the PARAM Siddhi AI which ranks 120th in world, with an Rpeak of 5.267 petaflops.

Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbours, stores the result within itself and passes it downstream. Systolic arrays were first used in Colossus, which was an early computer used to break German Lorenz ciphers during World War II. Due to the classified nature of Colossus, they were independently invented or rediscovered by H. T. Kung and Charles Leiserson who described arrays for many dense linear algebra computations for banded matrices. Early applications include computing greatest common divisors of integers and polynomials. They are sometimes classified as multiple-instruction single-data (MISD) architectures under Flynn's taxonomy, but this classification is questionable because a strong argument can be made to distinguish systolic arrays from any of Flynn's four categories: SISD, SIMD, MISD, MIMD, as discussed later in this article.

<span class="mw-page-title-main">Multiple instruction, single data</span>

In computing, multiple instruction, single data (MISD) is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault tolerance executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Applications for this architecture are much less common than MIMD and SIMD, as the latter two are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources. However, one prominent example of MISD in computing are the Space Shuttle flight control computers.

<span class="mw-page-title-main">ASCI Red</span> Supercomputer

ASCI Red was the first computer built under the Accelerated Strategic Computing Initiative (ASCI), the supercomputing initiative of the United States government created to help the maintenance of the United States nuclear arsenal after the 1992 moratorium on nuclear testing.

<span class="mw-page-title-main">Charles E. Leiserson</span> American computer scientist

Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. He invented the fat-tree interconnection network, a hardware-universal interconnection network used in many supercomputers, including the Connection Machine CM5, for which he was network architect. He helped pioneer the development of VLSI theory, including the retiming method of digital optimization with James B. Saxe and systolic arrays with H. T. Kung. He conceived of the notion of cache-oblivious algorithms, which are algorithms that have no tuning parameters for cache size or cache-line length, but nevertheless use cache near-optimally. He developed the Cilk language for multithreaded programming, which uses a provably good work-stealing algorithm for scheduling. Leiserson coauthored the standard algorithms textbook Introduction to Algorithms together with Thomas H. Cormen, Ronald L. Rivest, and Clifford Stein.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

<span class="mw-page-title-main">Network on a chip</span> Electronic communication subsystem on an integrated circuit

A network on a chip or network-on-chip is a network-based communications subsystem on an integrated circuit ("microchip"), most typically between modules in a system on a chip (SoC). The modules on the IC are typically semiconductor IP cores schematizing various functions of the computer system, and are designed to be modular in the sense of network science. The network on chip is a router-based packet switching network between SoC modules.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

A hypertree network is a network topology that shares some traits with the binary tree network. It is a variation of the fat tree architecture.

<span class="mw-page-title-main">SpiNNaker</span>

SpiNNaker is a massively parallel, manycore supercomputer architecture designed by the Advanced Processor Technologies Research Group (APT) at the Department of Computer Science, University of Manchester. It is composed of 57,600 processing nodes, each with 18 ARM9 processors and 128 MB of mobile DDR SDRAM, totalling 1,036,800 cores and over 7 TB of RAM. The computing platform is based on spiking neural networks, useful in simulating the human brain.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Data center is a pool of resources interconnected using a communication network. Data Center Network (DCN) holds a pivotal role in a data center, as it interconnects all of the data center resources together. DCNs need to be scalable and efficient to connect tens or even hundreds of thousands of servers to handle the growing demands of Cloud computing. Today's data centers are constrained by the interconnection network.

In computing, energy proportionality is a measure of the relationship between power consumed in a computer system, and the rate at which useful work is done. If the overall power consumption is proportional to the computer's utilization, then the machine is said to be energy proportional. Equivalently stated, for an idealized energy proportional computer, the overall energy per operation is constant for all possible workloads and operating conditions.

In the high-performance computing environment, burst buffer is a fast intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It bridges the performance gap between the processing speed of the compute nodes and the Input/output (I/O) bandwidth of the storage systems. Burst buffers are often built from arrays of high-performance storage devices, such as NVRAM and SSD. It typically offers from one to two orders of magnitude higher I/O bandwidth than the back-end storage systems.

A deep learning processor (DLP), or a deep learning accelerator, is an electronic circuit designed for deep learning algorithms, usually with separate data memory and dedicated instruction set architecture. Deep learning processors range from mobile devices, such as neural processing units (NPUs) in Huawei cellphones, to cloud computing servers such as tensor processing units (TPU) in the Google Cloud Platform.

Arun K. Somani is Associate Dean for Research of College of Engineering, Distinguished Professor of Electrical and Computer Engineering and Philip and Virginia Sproul Professor at Iowa State University. Somani is Elected Fellow of Institute of Electrical and Electronics Engineers (IEEE) for “contributions to theory and applications of computer networks” from 1999 to 2017 and Life Fellow of IEEE since 2018. He is Distinguished Engineer of Association for Computing Machinery(ACM) and Elected Fellow of The American Association for the Advancement of Science(AAAS).

References

  1. 1 2 Leiserson, Charles E (October 1985). "Fat-trees: universal networks for hardware-efficient supercomputing" (PDF). IEEE Transactions on Computers. 34 (10): 892–901. doi:10.1109/TC.1985.6312192. S2CID   8927584.
  2. Petrini, Fabrizio (1997). "K-ary n-trees: High performance networks for massively parallel architectures". Proceedings 11th International Parallel Processing Symposium. Vol. doi: 10.1109/IPPS.1997.580853. pp. 87–93. doi:10.1109/IPPS.1997.580853. ISBN   0-8186-7793-7. S2CID   6608892.
  3. Leiserson, Charles E.; Abuhamdeh, Zahi S.; Douglas, David C.; Feynman, Carl R.; Ganmukhi, Mahesh N.; Hill, Jeffrey V.; Daniel Hillis, W.; Kuszmaul, Bradley C.; St. Pierre, Margaret A.; Wells, David S.; Wong, Monica C.; Yang, Shaw-Wen; Zak, Robert (1992). "The Network Architecture of the Connection Machine CM-5". SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures. ACM. pp. 272–285. doi:10.1145/140901.141883. ISBN   978-0-89791-483-3. S2CID   6307237.
  4. Yuefan Deng (2013). "3.2.1 Hardware systems: Network Interconnections: Topology". Applied Parallel Computing. World Scientific. p. 25. ISBN   978-981-4307-60-4.
  5. "November 2018 TOP500". TOP500. November 2018. Retrieved 2019-02-11.
  6. "Summit - Oak Ridge National Laboratory's next High Performance Supercomputer". Oak Ridge Leadership Computing Facility . Retrieved 2019-02-11.
  7. Barney, Blaise (2019-01-18). "Using LC's Sierra Systems - Hardware - Mellanox EDR InfiniBand Network - Topology and LC Sierra Configuration". Lawrence Livermore National Laboratory . Retrieved 2019-02-11.
  8. Dongarra, Jack (2013-06-03). "Visit to the National University for Defense Technology Changsha, China" (PDF). Netlib . Retrieved 2013-06-17.
  9. Al-Fares, Mohammad; Loukissas, Alexander; Vahdat, Amin (2008). "A scalable, commodity data center network architecture" (PDF). Proceedings of the ACM SIGCOMM 2008 conference on Data communication. ACM. pp. 63–74. doi:10.1145/1402958.1402967. ISBN   978-1-60558-175-0. S2CID   65842.

Further reading