Alps (supercomputer)

Last updated
Alps
CSCS Supercomputer Alps 1.jpg
Activeoperational 2024
Sponsors Swiss Confederation
OperatorsSwiss National Supercomputing Centre (CSCS)
LocationLugano-Cornadero, Switzerland
ArchitectureHPE Cray EX254n: Nvidia GH200 Grace Hopper with combinations of Grace 72 ARMv9-Neoverse-V2 CPUs and Hopper H100 Tensor Core GPUs (1'305'600 cores total)
Power10 MW under full load
Operating system Linux
Memory144 terabytes (TB)
Speed270 PFLOPS (Rmax)
Ranking TOP500 : 6, June 2024
Websitecscs.ch
Sources "Nvidia GH200 Grace Hopper Superchip"

The Alps supercomputer is a high-performance computer funded by the Swiss Confederation through the ETH Domain, with its main location in Lugano. It is part of the Swiss National Supercomputing Centre (CSCS), which provides computing services for selected scientific customers. [1]

Contents

The Swiss National Supercomputing Centre (CSCS) was founded in 1991. This center operates a user lab for computing services. Examples in the past include the analysis of data from the Large Hadron Collider (LHC) at CERN, data storage for the X-ray laser SwissFEL of the Paul Scherrer Institute, and simulations for weather forecasts by MeteoSwiss. [2] These computing services have been provided over time by increasingly powerful computing systems. Since 2020 and the commissioning of the high-performance computer HPE Cray EX, the name Alps has been used for the new computers. On September 14, 2024, the latest supercomputer AlpsHPE Cray EX254n was inaugurated. Even beforehand, the planned performance of Alps was described as being able to train the LLM GPT-3 from OpenAI in two days. [3] This supercomputer is based on Grace Hopper GH200 integrated circuits (ICs) from Nvidia [4] [5] and achieves a performance of 270 petaflops per second, which means 270 quadrillion operations per second. In 2024, it ranks 6th (TOP500 list) among the world's fastest computers, although the in-house computers of Meta, Microsoft, Alphabet Inc./Google LLC, and Oracle are likely more powerful, but their performance is not known. A panel of experts from various natural sciences decides who is allowed to use this new computer. The use by a research collaboration of EPFL and the Yale Institute for Global Health has already been approved. This research group uses an open-source AI model from Meta and trained it on Alps with health data from medical research. With Alps, scientists in Switzerland receive an infrastructure to exploit many possibilities of artificial intelligence (AI). The new supercomputer is used as part of the Swiss AI Initiative by the ETH Zurich and EPFL.

Structure

Office Building CSCS in Lugano, Switzerland CSCS Office Building outside view.jpg
Office Building CSCS in Lugano, Switzerland
Underground water-cooling distribution for CSCS computers CSCS Raffreddamento Watercooling1.jpg
Underground water-cooling distribution for CSCS computers

To suitably house and operate modern supercomputers, a new data center building and an adjacent office building were constructed in Lugano-Cornadero. The data center building consists of three floors. The lowest floor houses the basic infrastructure with primary power and water distribution as well as an emergency power supply via batteries. The cooling of the computers and the buildings in summer is done with lake water from Lake Lugano. From a depth of 45 meters, 460 liters of cold lake water per second are supplied to the data center via 2.8 km long pipes. There, it cools the internal cooling circuit of the computer via a heat exchanger. [6] The secondary distribution is done on the middle floor using power distribution units, which allow flexible installation of the computers above. The computers are located on the top floor. [7] The latest Alps highly-parallel supercomputer was delivered by Hewlett Packard Enterprise (HPE), which acquired the supercomputer-specialized company Cray as a subsidiary in 2019. It is installed on an area of 2000 m2. The total cost was about 100 million CHF.

Electronics

Interior of electronics cabinets "Alps" supercomputer CSCS Supercomputer Alps 2.jpg
Interior of electronics cabinets "Alps" supercomputer

In order to achieve superior performance, combinations of central processors (CPUs) with graphics processors (GPUs) as well as their associated memories (128 GB LPDDR-5X RAM; 96 GB HBM-3) [8] are placed in close proximity on the same monolithic integrated circuit provided by Nvidia. Arrays of 72 CPUs are called Grace and consist of ARMv9-Neoverse-V2 processors, which are RISC processors. The 132 GPUs are called Hopper H100 Tensor Core. [9] The combinations of said 72 CPUs together with 132 GPUs integrated on a VLSI chip are called GH200 Grace Hopper in memory of Grace Hopper. A total of 1'305'600 processor cores (CPUs and GPUs) are available on this Alps system. Data exchanges between the 2'688 nodes occur on an Ethernet-type network called Slingshot-11 at a rate of 200 Gbit/s. [10] [8] A single node is composed of four GH200, in a Quad GH200 configuration. Every Quad GH200 node acts as a single NUMA system, with 288 CPU cores and 4 GPUs. The Grace CPUs communicate through a cache-coherent interconnect, while the Hopper GPUs communicate through NVLink. [11]

Operation

A team from CSCS develops special software for different applications. The power consumption of the computer at full load is 10 MW. The electricity costs are estimated to be around 15 million CHF per year.

Related Research Articles

<span class="mw-page-title-main">Supercomputer</span> Type of extremely powerful computer

A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2022, supercomputers have existed which can perform over 1018 FLOPS, so called exascale supercomputers. For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Floating point operations per second is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations.

Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed in the TOP500, which ranks the most powerful supercomputers in the world.

<span class="mw-page-title-main">TOP500</span> Database project devoted to the ranking of computers

The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coincides with the International Supercomputing Conference in June, and the second is presented at the ACM/IEEE Supercomputing Conference in November. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and bases rankings on HPL benchmarks, a portable implementation of the high-performance LINPACK benchmark written in Fortran for distributed-memory computers.

The Green500 is a biannual ranking of supercomputers, from the TOP500 list of supercomputers, in terms of energy efficiency. The list measures performance per watt using the TOP500 measure of high performance LINPACK benchmarks at double-precision floating-point format.

The National Center for Computational Sciences (NCCS) is a United States Department of Energy (DOE) Leadership Computing Facility that houses the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility charged with helping researchers solve challenging scientific problems of global interest with a combination of leading high-performance computing (HPC) resources and international expertise in scientific computing.

This list compares various amounts of computing power in instructions per second organized by order of magnitude in FLOPS.

Brutus is the central high-performance cluster of ETH Zurich. It was introduced to the public in May 2008. A new computing cluster called EULER has been announced and opened to the public in May 2014.

The Swiss National Supercomputing Centre is the national high-performance computing centre of Switzerland. It was founded in Manno, canton Ticino, in 1991. In March 2012, the CSCS moved to its new location in Lugano-Cornaredo.

<span class="mw-page-title-main">Supercomputing in Europe</span> Overview of supercomputing in Europe

Several centers for supercomputing exist across Europe, and distributed access to them is coordinated by European initiatives to facilitate high-performance computing. One such initiative, the HPC Europa project, fits within the Distributed European Infrastructure for Supercomputing Applications (DEISA), which was formed in 2002 as a consortium of eleven supercomputing centers from seven European countries. Operating within the CORDIS framework, HPC Europa aims to provide access to supercomputers across Europe.

<span class="mw-page-title-main">Titan (supercomputer)</span> American supercomputer

Titan or OLCF-3 was a supercomputer built by Cray at Oak Ridge National Laboratory for use in a variety of science projects. Titan was an upgrade of Jaguar, a previous supercomputer at Oak Ridge, that uses graphics processing units (GPUs) in addition to conventional central processing units (CPUs). Titan was the first such hybrid to perform over 10 petaFLOPS. The upgrade began in October 2011, commenced stability testing in October 2012 and it became available to researchers in early 2013. The initial cost of the upgrade was US$60 million, funded primarily by the United States Department of Energy.

XK7 is a supercomputing platform, produced by Cray, launched on October 29, 2012. XK7 is the second platform from Cray to use a combination of central processing units ("CPUs") and graphical processing units ("GPUs") for computing; the hybrid architecture requires a different approach to programming to that of CPU-only supercomputers. Laboratories that host XK7 machines host workshops to train researchers in the new programming languages needed for XK7 machines. The platform is used in Titan, the world's second fastest supercomputer in the November 2013 list as ranked by the TOP500 organization. Other customers include the Swiss National Supercomputing Centre which has a 272 node machine and Blue Waters has a machine that has Cray XE6 and XK7 nodes that performs at approximately 1 petaFLOPS (1015 floating-point operations per second).

The Cray XC30 is a massively parallel multiprocessor supercomputer manufactured by Cray. It consists of Intel Xeon processors, with optional Nvidia Tesla or Xeon Phi accelerators, connected together by Cray's proprietary "Aries" interconnect, stored in air-cooled or liquid-cooled cabinets. Each liquid-cooled cabinet can contain up to 48 blades, each with eight CPU sockets, and uses 90 kW of power. The XC series supercomputers are available with the Cray DataWarp applications I/O accelerator technology.

<span class="mw-page-title-main">Cray XC40</span> Supercomputer manufactured by Cray

The Cray XC40 is a massively parallel multiprocessor supercomputer manufactured by Cray. It consists of Intel Haswell Xeon processors, with optional Nvidia Tesla or Intel Xeon Phi accelerators, connected together by Cray's proprietary "Aries" interconnect, stored in air-cooled or liquid-cooled cabinets. The XC series supercomputers are available with the Cray DataWarp applications I/O accelerator technology.

<span class="mw-page-title-main">Nvidia DGX</span> Line of Nvidia produced servers and workstations

The Nvidia DGX represents a series of servers and workstations designed by Nvidia, primarily geared towards enhancing deep learning applications through the use of general-purpose computing on graphics processing units (GPGPU). These systems typically come in a rackmount format featuring high-performance x86 server CPUs on the motherboard.

Piz Daint is a supercomputer in the Swiss National Supercomputing Centre, named after the mountain Piz Daint in the Swiss Alps.

<span class="mw-page-title-main">Frontier (supercomputer)</span> American supercomputer

Hewlett Packard Enterprise Frontier, or OLCF-5, is the world's first exascale supercomputer. It is hosted at the Oak Ridge Leadership Computing Facility (OLCF) in Tennessee, United States and became operational in 2022. As of November 2024, Frontier is the second fastest supercomputer in the world. It is based on the Cray EX and is the successor to Summit (OLCF-4). Frontier achieved an Rmax of 1.102 exaFLOPS, which is 1.102 quintillion floating-point operations per second, using AMD CPUs and GPUs.

<span class="mw-page-title-main">Hopper (microarchitecture)</span> GPU microarchitecture designed by Nvidia

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is used alongside the Lovelace microarchitecture. It is the latest generation of the line of products formerly branded as Nvidia Tesla, now Nvidia Data Centre GPUs.

<span class="mw-page-title-main">LUMI</span> Supercomputer in Finland

LUMI is a petascale supercomputer located at the CSC data center in Kajaani, Finland. As of January 2023, the computer is the fastest supercomputer in Europe.

Selene is a supercomputer developed by Nvidia, capable of achieving 63.460 petaflops, ranking as the fifth fastest supercomputer in the world, when it entered the list. Selene is based on the Nvidia DGX system consisting of AMD CPUs, Nvidia A100 GPUs, and Mellanox HDDR networking. Selene is based on the Nvidia DGX Superpod, which is a high performance turnkey supercomputer solution provided by Nvidia using DGX hardware. DGX Superpod is a tightly integrated system that combines high performance DGX compute nodes with fast storage and high bandwidth networking. It aims to provide a turnkey solution to high-demand machine learning workloads. Selene was built in three months and is the fastest industrial system in the US while being the second-most energy-efficient supercomputing system ever.

References

  1. Gioia da Silva: ETH weiht einen der modernsten KI-Supercomputer der Welt ein. In: Neue Zürcher Zeitung, 14 September 2024. Retrieved 26 September 2024
  2. About CSCS. cscs.ch. Retrieved 26 September 2024
  3. Alp's system to advance research across climate, physics, life sciences with 7x more powerful AI capabilities than current world-leading system for AI on MLPerf. nvidia.com, 12 April 2021. Retrieved 26 September 2024
  4. Benedikt Schwan (2023-06-01). "Nvidia: Die KI aus dem Monstercomputer" (in German). Zeit Online . Retrieved 2024-09-26.
  5. Neue Forschungsinfrastruktur: ‘Alps’ Supercomputer eingeweiht. ETH Zürich, 14 September 2024. Retrieved 26 September 2024
  6. Lake water to cool supercomputers. cscs.ch 2015. Retrieved 26 September 2024
  7. Innovative new building for CSCS in Lugano. cscs.ch 2015. Retrieved 26 September 2024
  8. 1 2 Alps: System Specification. cscs.ch. Retrieved 1 October 2024
  9. Datasheet: NVIDIA GH200 Grace Hopper Superchip. nvidia.com. Retrieved 30 September 2024
  10. TOP500: Alps, top500.org. Retrieved 30 September 2024
  11. Fusco, Luigi; et al. "Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip". arXiv: 2408.11556 .