TOSS (operating system)

Last updated
Tri-Lab Operating System Stack
OS family Unix-like
Working stateCurrent
Package manager RPM Package Manager [1]

The Tri-Lab Operating System Stack (TOSS) is a Linux distribution based on Red Hat Enterprise Linux (RHEL) that was created to provide a software stack [1] for high performance computing (HPC) clusters [2] for laboratories within the National Nuclear Security Administration (NNSA). [3] The operating system allows multiple smaller systems to emulate a high-performance computing (HPC) platform. [1]

Linux distribution

The name "tri-lab" refers to the three major NNSA labs, the Lawrence Livermore National Laboratory, the Los Alamos National Laboratory, and the Sandia National Laboratories. [4]

The OS is used by NNSA computers including the El Capitan supercomputer [5] and systems using ARM architecture including the ThunderX2 system on a chip (SoC). [6] In addition to being used by the National Nuclear Security Administration (NNSA), [2] most of the systems in NASA's High-End Computing Capabiity Project, part of the NASA Advanced Supercomputing Division, were all migrated to TOSS in March 2022. [7]

Many of the software packages included in TOSS are from the RHEL repository. Additional packages are built using Fedora's Koji build system to create RPM packages. [1] The system also uses SLURM and Flux scheduling and resource management software. [1]

Related Research Articles

<span class="mw-page-title-main">Supercomputer</span> Type of extremely powerful computer

A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there have existed supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Lawrence Livermore National Laboratory (LLNL) is a federal research facility in Livermore, California, United States. The lab was originally established as the University of California Radiation Laboratory, Livermore Branch in 1952 in response to the detonation of the first atomic bomb by the Soviet Union during the Cold War. It later became autonomous in 1971 and was designated a national laboratory in 1981.

<span class="mw-page-title-main">FLOPS</span> Measure of computer performance

In computing, floating point operations per second is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measure than measuring instructions per second.

<span class="mw-page-title-main">Sandia National Laboratories</span> United States research lab

Sandia National Laboratories (SNL), also known as Sandia, is one of three research and development laboratories of the United States Department of Energy's National Nuclear Security Administration (NNSA). Headquartered in Kirtland Air Force Base in Albuquerque, New Mexico, it has a second principal facility next to Lawrence Livermore National Laboratory in California and a test facility in Waimea, Kauai, Hawaii. Sandia is owned by the U.S. federal government but privately managed and operated by National Technology and Engineering Solutions of Sandia, a wholly owned subsidiary of Honeywell International.

<span class="mw-page-title-main">High-performance computing</span> Computing with supercomputers and clusters

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.

<span class="mw-page-title-main">Advanced Simulation and Computing Program</span>

The Advanced Simulation and Computing Program is a super-computing program run by the National Nuclear Security Administration, in order to simulate, test, and maintain the United States nuclear stockpile. The program was created in 1995 in order to support the Stockpile Stewardship Program. The goal of the initiative is to extend the lifetime of the current aging stockpile.

<span class="mw-page-title-main">Edinburgh Parallel Computing Centre</span> Supercomputing centre at the University of Edinburgh

EPCC, formerly the Edinburgh Parallel Computing Centre, is a supercomputing centre based at the University of Edinburgh. Since its foundation in 1990, its stated mission has been to accelerate the effective exploitation of novel computing throughout industry, academia and commerce.

<span class="mw-page-title-main">National Energy Research Scientific Computing Center</span> Supercomputer facility operated by the US Department of Energy in Berkeley, California

The National Energy Research Scientific Computing Center (NERSC), is a high-performance computing (supercomputer) National User Facility operated by Lawrence Berkeley National Laboratory for the United States Department of Energy Office of Science. As the mission computing center for the Office of Science, NERSC houses high performance computing and data systems used by 9,000 scientists at national laboratories and universities around the country. NERSC's newest and largest supercomputer is Perlmutter, which debuted in 2021 ranked 5th on the TOP500 list of world's fastest supercomputers.

<span class="mw-page-title-main">Roadrunner (supercomputer)</span>

Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.

<span class="mw-page-title-main">The Portland Group</span> American technology company

PGI was a company that produced a set of commercially available Fortran, C and C++ compilers for high-performance computing systems. On July 29, 2013, Nvidia acquired The Portland Group, Inc. As of August 5, 2020, the "PGI Compilers and Tools" technology is a part of the Nvidia HPC SDK product available as a free download from Nvidia.

<span class="mw-page-title-main">TOP500</span> Database project devoted to the ranking of computers

The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coincides with the International Supercomputing Conference in June, and the second is presented at the ACM/IEEE Supercomputing Conference in November. The project aims to provide a reliable basis for tracking and detecting trends in high-performance computing and bases rankings on HPL, a portable implementation of the high-performance LINPACK benchmark written in Fortran for distributed-memory computers.

<span class="mw-page-title-main">Sequoia (supercomputer)</span>

IBM Sequoia was a petascale Blue Gene/Q supercomputer constructed by IBM for the National Nuclear Security Administration as part of the Advanced Simulation and Computing Program (ASC). It was delivered to the Lawrence Livermore National Laboratory (LLNL) in 2011 and was fully deployed in June 2012. Sequoia was dismantled in 2020, its last position on the top500.org list was #22 in the November 2019 list.

Exascale computing refers to computing systems capable of calculating at least "1018 IEEE 754 Double Precision (64-bit) operations (multiplications and/or additions) per second (exaFLOPS)"; it is a measure of supercomputer performance.

A lightweight kernel (LWK) operating system is one used in a large computer with many processor cores, termed a parallel computer.

<span class="mw-page-title-main">Supercomputer operating system</span>

A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu.

<span class="mw-page-title-main">Appro</span> American technology company

Appro was a developer of supercomputing supporting High Performance Computing (HPC) markets focused on medium- to large-scale deployments. Appro was based in Milpitas, California with a computing center in Houston, Texas, and a manufacturing and support subsidiary in South Korea and Japan.

<span class="mw-page-title-main">PSSC Labs</span>

PSSC Labs is a California-based company that provides supercomputing solutions in the United States and internationally. Its products include "high-performance" servers, clusters, workstations, and RAID storage systems for scientific research, government and military, entertainment content creators, developers, and private clouds. The company has implemented clustering software from NASA Goddard's Beowulf project in its supercomputers designed for bioinformatics, medical imaging, computational chemistry and other scientific applications.

<span class="mw-page-title-main">Frontier (supercomputer)</span> American supercomputer

Hewlett Packard Enterprise Frontier, or OLCF-5, is the world's first exascale supercomputer, hosted at the Oak Ridge Leadership Computing Facility (OLCF) in Tennessee, United States and first operational in 2022. It is based on the Cray EX and is the successor to Summit (OLCF-4). As of June 2022, Frontier was the world's fastest supercomputer using AMD CPUs and GPUs. Frontier achieved an Rmax of 1.102 exaFLOPS, which is 1.102 quintillion operations per second. Measured at 62.68 gigaflops/watt, Frontier topped the Green500 list for most efficient supercomputer, until it was dethroned by Flatiron Institute's Henri supercomputer in November 2022.

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), OpenCL.

JUWELS is a supercomputer developed by Atos Forschungszentrum Jülich, capable of 70.980 petaflops. It replaced the now disused JUQUEEN supercomputer. JUWELS Booster Module is ranked as the eight fastest supercomputer in the world. The JUWELS Booster Module is part of a modular system architecture and a second Xeon based JUWELS Module ranks separately as the 52nd fastest supercomputer in the world.

References

  1. 1 2 3 4 5 de Supinski, Bronis R. (August 29, 2019). The LLNL Near and Long Term Vision for Large-Scale Systems (PDF) (Report). Retrieved August 28, 2022.
  2. 1 2 "TOSS: Speeding Up Commodity Cluster Computing". Lawrence Livermore National Laboratory . Retrieved August 28, 2022.
  3. León, Edgar A.; D'Hooge, Trent; Hanford, Nathan; Karlin, Ian; Pankajakshan, Ramesh; Foraker, Jim; Chambreau, Chris; Leininger, Matthew L. (November 2020). TOSS-2020: a commodity software stack for HPC. SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta, Georgia. pp. 1–15. ISBN   978-1-7281-9998-6. OCLC   1223541587 via IEEE.
  4. Morgan, Timothy Prickett (November 26, 2018). "One Linux stack to rule HPC and AI". NextPlatform.com. Retrieved August 28, 2022.
  5. Степин, Алексей (June 23, 2022). "2-Эфлопс cуперкомпьютер El Capitan получит новейшие APU AMD MI300" [2-Eflops El Capitan supercomputer will receive the latest AMD MI300 APUs]. ServerNews.ru (in Russian). Retrieved August 29, 2022. В El Capitan лаборатория перейдет от использования проприетарного системного и управляющего ПО к собственному стеку NNSA Tri-Lab Operating System Stack (TOSS).[At El Capitan, the laboratory will move from using proprietary system and management software to its own NNSA Tri-Lab Operating System Stack (TOSS).]
  6. Feldman, Michael (June 18, 2018). "Sandia to Install First Petascale Supercomputer Powered by ARM Processors". Top500 . Retrieved August 29, 2022.
  7. "Migration to TOSS Operating System". NASA . July 20, 2022. Retrieved August 28, 2022.