High-throughput computing

Last updated July 20, 2023

In computer science, high-throughput computing (HTC) is the use of many computing resources over long periods of time to accomplish a computational task.

Challenges

The HTC community is also concerned with robustness and reliability of jobs over a long-time scale. That is, being able to create a reliable system from unreliable components. This research is similar to transaction processing, but at a much larger and distributed scale.

Some HTC systems, such as HTCondor and PBS, can run tasks on opportunistic resources. It is a difficult problem, however, to operate in this environment. On one hand the system needs to provide a reliable operating environment for the user's jobs, but at the same time the system must not compromise the integrity of the execute node and allow the owner to always have full control of their resources.

High-throughput vs. high-performance vs. many-task

There are many differences between high-throughput computing, high-performance computing (HPC), and many-task computing (MTC).

HPC tasks are characterized as needing large amounts of computing power for short periods of time, whereas HTC tasks also require large amounts of computing, but for much longer times (months and years, rather than hours and days).^[1]^[2] HPC environments are often measured in terms of FLOPS.

The HTC community, however, is not concerned about operations per second, but rather operations per month or per year. Therefore, the HTC field is more interested in how many jobs can be completed over a long period of time instead of how fast.

As an alternative definition, the European Grid Infrastructure defines HTC as "a computing paradigm that focuses on the efficient execution of a large number of loosely-coupled tasks",^[3] while HPC systems tend to focus on tightly coupled parallel jobs, and as such they must execute within a particular site with low-latency interconnects. Conversely, HTC systems are independent, sequential jobs that can be individually scheduled on many different computing resources across multiple administrative boundaries. HTC systems achieve this using various grid computing technologies and techniques.

MTC aims to bridge the gap between HTC and HPC. MTC is reminiscent of HTC, but it differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC denotes high-performance computations comprising multiple distinct activities, coupled via file system operations.

Related Research Articles

A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there have existed supercomputers which can perform over 10¹⁷ FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (10¹¹) to tens of teraFLOPS (10¹³). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically be run at scheduled times as well as being run contingent on the availability of computer resources.

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.

HTCondor is an open-source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workload on a dedicated cluster of computers, or to farm out work to idle desktop computers – so-called cycle scavenging. HTCondor runs on Linux, Unix, Mac OS X, FreeBSD, and Microsoft Windows operating systems. HTCondor can integrate both dedicated resources and non-dedicated desktop machines into one computing environment.

The HTC Corporation is a Taiwanese manufacturer of handheld devices.

UNICORE (UNiform Interface to COmputing REsources) is a grid computing technology for resources such as supercomputers or cluster systems and information stored in databases. UNICORE was developed in two projects funded by the German ministry for education and research (BMBF). In European-funded projects UNICORE evolved to a middleware system used at several supercomputer centers. UNICORE served as a basis in other research projects. The UNICORE technology is open source under BSD licence and available at SourceForge.

EPCC, formerly the Edinburgh Parallel Computing Centre, is a supercomputing centre based at the University of Edinburgh. Since its foundation in 1990, its stated mission has been to accelerate the effective exploitation of novel computing throughout industry, academia and commerce.

The Open Grid Forum (OGF) is a community of users, developers, and vendors for standardization of grid computing. It was formed in 2006 in a merger of the Global Grid Forum and the Enterprise Grid Alliance. The OGF models its process on the Internet Engineering Task Force (IETF), and produces documents with many acronyms such as OGSA, OGSI, and JSDL.

<span class="mw-page-title-main">Miron Livny</span>

Miron Livny is a senior researcher and professor specializing in distributed computing at the University of Wisconsin–Madison. Livny has been a professor of computer science at Wisconsin since 1983, where he leads the HTCondor high-throughput computing system project. Miron is also a principal investigator and currently the facility coordinator for the Open Science Grid project, Director of the Center for High Throughput computing, CTO of Wisconsin Institutes for Discovery, and Director of Core Computational Technology of the Morgridge Institute for Research.

The Oak Ridge Leadership Computing Facility (OLCF), formerly the National Leadership Computing Facility, is a designated user facility operated by Oak Ridge National Laboratory and the Department of Energy. It contains several supercomputers, the largest of which is an HPE OLCF-5 named Frontier, which was ranked 1st on the TOP500 list of world's fastest supercomputers as of June 2022. It is located in Oak Ridge, Tennessee.

Univa was a software company that developed workload management and cloud management products for compute-intensive applications in the data center and across public, private, and hybrid clouds, before being acquired by Altair Engineering in September 2020.

In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. This rate is typically measured by performance on the LINPACK benchmark when trying to compare between computing systems: an example using this is the Green500 list of supercomputers. Performance per watt has been suggested to be a more sustainable measure of computing than Moore’s Law.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

Many-task computing (MTC) in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) and high-performance computing (HPC).

Techila Distributed Computing Engine is a commercial grid computing software product. It speeds up simulation, analysis and other computational applications by enabling scalability across the IT resources in user's on-premises data center and in the user's own cloud account. Techila Distributed Computing Engine is developed and licensed by Techila Technologies Ltd, a privately held company headquartered in Tampere, Finland. The product is also available as an on-demand solution in Google Cloud Launcher, the online marketplace created and operated by Google. According to IDC, the solution enables organizations to create HPC infrastructure without the major capital investments and operating expenses required by new HPC hardware.

Quasi-opportunistic supercomputing is a computational paradigm for supercomputing on a large number of geographically disperse computers. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing.

The history of computer clusters is best captured by a footnote in Greg Pfister's In Search of Clusters: “Virtually every press release from DEC mentioning clusters says ‘DEC, who invented clusters...’. IBM did not invent them either. Customers invented clusters, as soon as they could not fit all their work on one computer, or needed a backup. The date of the first is unknown, but it would be surprising if it was not in the 1960s, or even late 1950s.”

Cycle Computing is a company that provides software for orchestrating computing and storage resources in cloud environments. The flagship product is CycleCloud, which supports Amazon Web Services, Google Compute Engine, Microsoft Azure, and internal infrastructure. The CycleCloud orchestration suite manages the provisioning of cloud infrastructure, orchestration of workflow execution and job queue management, automated and efficient data placement, full process monitoring and logging, within a secure process flow.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

References

↑ Beck, Alan (1997-06-27). "High Throughput Computing: An Interview with Miron Livny (Broken link)". HPCWire.
↑ "High Throughput Computing: An Interview with Miron Livny (Backup link)".
↑ "EGI Glossary V1".

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Beck, Alan (1997-06-27). "High Throughput Computing: An Interview with Miron Livny (Broken link)". HPCWire.

[2] "High Throughput Computing: An Interview with Miron Livny (Backup link)".

[3] "EGI Glossary V1".

[1]

[2]

[3]

High-throughput computing

Contents

Challenges

High-throughput vs. high-performance vs. many-task

See also

Related Research Articles

References