Many-task computing

Last updated

Many-task computing (MTC) [1] [2] [3] [4] [5] [6] [7] in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) [8] and high-performance computing (HPC).

Contents

Definition

MTC is reminiscent of HTC, but it "differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. MTC includes loosely coupled applications that are generally communication-intensive but not naturally expressed using standard message passing interface commonly found in HPC, drawing attention to the many computations that are heterogeneous but not "happily" parallel". [6]

Raicu et al. further state: "There is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective. Some applications have just so many simple tasks that managing them is hard. Applications that operate on or produce large amounts of data need sophisticated data management in order to scale. There exist applications that involve many tasks, each composed of tightly coupled MPI tasks. Loosely coupled applications often have dependencies among tasks, and typically use files for inter-process communication. Efficient support for these sorts of applications on existing large scale systems will involve substantial technical challenges and will have big impact on science." [6]

Some related areas are multiple program multiple data (MPMD), high throughput computing (HTC), workflows, capacity computing, or embarrassingly parallel. Some projects that could support MTC workloads are Condor, [9] Mapreduce, [10] Hadoop, [11] Boinc, [12] Cobalt [ permanent dead link ] HTC-mode, [13] Falkon, [14] and Swift., [15] [16]

Related Research Articles

Supercomputer Extremely powerful computer for its era

A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). Since November 2017, all of the world's fastest 500 supercomputers run Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Beowulf cluster

A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.

David Bader (computer scientist) American computer scientist

David A. Bader is a Distinguished Professor and Director of the Institute for Data Science at the New Jersey Institute of Technology. Previously, he served as a Professor, Chair of the School of Computational Science and Engineering, and Executive Director of High-Performance Computing in the Georgia Tech College of Computing. In addition, Bader was selected as the director of the first Sony Toshiba IBM Center of Competence for the Cell Processor at the Georgia Institute of Technology. He is an IEEE Fellow, AAAS Fellow, SIAM Fellow. His main areas of research are in at the intersection of high-performance computing and real-world applications, including cybersecurity, massive-scale analytics, and computational genomics.

GPFS, the General Parallel File System is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 top500 list of supercomputers . Summit is a 200 Petaflops system composed of more than 9,000 IBM POWER microprocessors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

Charlie Catlett

Charlie Catlett is a Senior Research Scientist at the University of Illinois Discovery Partners Institute and a Visiting Senior Fellow at the Mansueto Institute for Urban Dynamics at the University of Chicago. He was previously a Senior Computer Scientist at Argonne National Laboratory and a Senior Fellow in the Computation Institute, a joint institute of Argonne National Laboratory and The University of Chicago, and a Senior Fellow at the University of Chicago's Harris School of Public Policy.

Nimrod is a tool for the parameterisation of serial programs to create and execute embarrassingly parallel programs over a computational grid. It is a co-allocating, scheduling and brokering service. Nimrod was one of the first tools to make use of heterogeneous resources in a grid for a single computation. It was also an early example of using a market economy to perform grid scheduling. This enables Nimrod to provide a guaranteed completion time despite using best-effort services.

Chapel (programming language)

Chapel, the Cascade High Productivity Language, is a parallel programming language developed by Cray. It is being developed as part of the Cray Cascade project, a participant in DARPA's High Productivity Computing Systems (HPCS) program, which had the goal of increasing supercomputer productivity by the year 2010. It is being developed as an open source project, under version 2 of the Apache license.

A cluster manager usually is a backend graphical user interface (GUI) or command-line software that runs on one or all cluster nodes .The cluster manager works together with a cluster management agent. These agents run on each node of the cluster to manage and configure services, a set of services, or to manage and configure the complete cluster server itself In some cases the cluster manager is mostly used to dispatch work for the cluster to perform. In this last case a subset of the cluster manager can be a remote desktop application that is used not for configuration but just to send work and get back work results from a cluster. In other cases the cluster is more related to availability and load balancing than to computational or specific service clusters.

High-throughput computing (HTC) is a computer science term to describe the use of many computing resources over long periods of time to accomplish a computational task.

gridMathematica is a software product sold by Wolfram Research which extends the parallel processing capabilities of its main product Mathematica.

Computer cluster

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many aspects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

Ignacio Martín Llorente is an entrepreneur, researcher and educator in the field of cloud and distributed computing. He is the Director of OpenNebula, Visiting Scholar at Harvard University and Full Professor at Complutense University.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

In computer science, a pilot job is a type of multilevel scheduling, in which a resource is acquired by an application so that the application can schedule work into that resource directly, rather than going through a local job scheduler, which might lead to queue waits for each work unit. This term comes from the Condor High-Throughput Computing System, in which Condor GlideIns provides this functionality. Other examples of pilot jobs are: the BigJob implemented in SAGA, Swift Coasters as part of the Swift parallel scripting system, the Falkon lightweight task execution framework, and HTCaaS.

Supercomputer architecture

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.

Francine Berman American computer scientist

Francine Berman is an American computer scientist, and a leader in digital data preservation and cyber-infrastructure. In 2009, she was the inaugural recipient of the IEEE/ACM-CS Ken Kennedy Award "for her influential leadership in the design, development and deployment of national-scale cyberinfrastructure, her inspiring work as a teacher and mentor, and her exemplary service to the high performance community". In 2004, Business Week called her the "reigning teraflop queen".

Swift (parallel scripting language)

Swift is an implicitly parallel programming language that allows writing scripts that distribute program execution across distributed computing resources, including clusters, clouds, grids, and supercomputers. Swift implementations are open-source software under the Apache License, version 2.0.

In the high-performance computing environment, burst buffer is a fast and intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It emerges as a timely storage solution to bridge the ever-increasing performance gap between the processing speed of the compute nodes and the Input/output (I/O) bandwidth of the storage systems. Burst buffer is built from arrays of high-performance storage devices, such as NVRAM and SSD. It typically offers from one to two orders of magnitude higher I/O bandwidth than the back-end storage systems.

ACM SIGHPC

ACM SIGHPC is the Association for Computing Machinery's Special Interest Group on High Performance Computing, an international community of students, faculty, researchers, and practitioners working on research and in professional practice related to supercomputing, high-end computers, and cluster computing. The organization co-sponsors international conferences related to high performance and scientific computing, including: SC, the International Conference for High Performance Computing, Networking, Storage and Analysis; the Platform for Advanced Scientific Computing (PASC) Conference; and PPoPP, the Symposium on Principles and Practice of Parallel Programming.

Michela Taufer is an Italian-American computer scientist and holds the Jack Dongarra Professorship in High Performance Computing within the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. She is an ACM Distinguished Scientist and an IEEE Senior Member.

References

  1. IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) 2008, http://datasys.cs.iit.edu/events/MTAGS08/
  2. ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS09) 2009, http://datasys.cs.iit.edu/events/MTAGS09/
  3. IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010, http://datasys.cs.iit.edu/events/MTAGS10/
  4. ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS11) 2011, http://datasys.cs.iit.edu/events/MTAGS11/
  5. IEEE Transactions on Parallel and Distributed Systems, Special Issue on Many-Task Computing, June 2011, http://datasys.cs.iit.edu/events/TPDS_MTC/
  6. 1 2 3 I. Raicu, I. Foster, Y. Zhao. "Many-Task Computing for Grids and Supercomputers", IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008
  7. "Many Task Computing: Bridging the performance-throughput gap", International Science Grid This Week (iSGTW), January 28th, 2009, http://www.isgtw.org/?pid=1001602 Archived 2011-01-01 at the Wayback Machine
  8. M. Livny, J. Basney, R. Raman, T. Tannenbaum. "Mechanisms for High Throughput Computing," SPEEDUP Journal 1(1), 1997
  9. D. Thain, T. Tannenbaum, M. Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience 17( 2-4), pp. 323-356, 2005
  10. J. Dean, S. Ghemawat. "MapReduce: Simplified data processing on large clusters." In OSDI, 2004
  11. A. Bialecki, M. Cafarella, D. Cutting, O. O'Malley. "Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware," http://lucene.apache.org/hadoop/ Archived 2007-02-10 at the Wayback Machine , 2005
  12. D.P. Anderson, "BOINC: A System for Public-Resource Computing and Storage," IEEE/ACM International Workshop on Grid Computing, 2004
  13. IBM Corporation. "High-Throughput Computing (HTC) Paradigm," IBM System Blue Gene Solution: Blue Gene/P Application Development, IBM RedBooks, 2008
  14. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. Wilde. "Falkon: A Fast and Lightweight Task Execution Framework," IEEE/ACM SC, 2007
  15. Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. "Swift: Fast, Reliable, Loosely Coupled Parallel Computation", IEEE SWF, 2007
  16. M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster." Swift: A language for distributed parallel scripting." Parallel Computing, 37:633–652, 2011.