Message passing is an inherent element of all computer clusters . All computer clusters, ranging from homemade Beowulfs to some of the fastest supercomputers in the world, rely on message passing to coordinate the activities of the many nodes they encompass. [1] [2] Message passing in computer clusters built with commodity servers and switches is used by virtually every internet service. [1]
Recently, the use of computer clusters with more than one thousand nodes has been spreading. As the number of nodes in a cluster increases, the rapid growth in the complexity of the communication subsystem makes message passing delays over the interconnect a serious performance issue in the execution of parallel programs. [3]
Specific tools may be used to simulate, visualize and understand the performance of message passing on computer clusters. Before a large computer cluster is assembled, a trace-based simulator can use a small number of nodes to help predict the performance of message passing on larger configurations. Following test runs on a small number of nodes, the simulator reads the execution and message transfer log files and simulates the performance of the messaging subsystem when many more messages are exchanged between a much larger number of nodes. [4] [5]
Historically, the two typical approaches to communication between cluster nodes have been PVM, the Parallel Virtual Machine and MPI, the Message Passing Interface. [6] However, MPI has now emerged as the de facto standard for message passing on computer clusters. [7]
PVM predates MPI and was developed at the Oak Ridge National Laboratory around 1989. It provides a set of software libraries that allow a computing node to act as a "parallel virtual machine". It provides run-time environment for message-passing, task and resource management, and fault notification and must be directly installed on every cluster node. PVM can be used by user programs written in C, C++, or Fortran, etc. [6] [8]
Unlike PVM, which has a concrete implementation, MPI is a specification rather than a specific set of libraries. The specification emerged in the early 1990 out of discussions between 40 organizations, the initial effort having been supported by ARPA and National Science Foundation. The design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use TCP/IP and socket connections. [6] MPI is now a widely available communications model that enables parallel programs to be written in languages such as C, Fortran, Python, etc. [8] The MPI specification has been implemented in systems such as MPICH and Open MPI. [8] [9]
Computer clusters use a number of strategies for dealing with the distribution of processing over multiple nodes and the resulting communication overhead. Some computer clusters such as Tianhe-I use different processors for message passing than those used for performing computations. Tiahnhe-I uses over two thousand FeiTeng-1000 processors to enhance the operation of its proprietary message passing system, while computations are performed by Xeon and Nvidia Tesla processors. [10] [11]
One approach to reducing communication overhead is the use of local neighborhoods (also called locales) for specific tasks. Here computational tasks are assigned to specific "neighborhoods" in the cluster, to increase efficiency by using processors which are closer to each other. [3] However, given that in many cases the actual topology of the computer cluster nodes and their interconnections may not be known to application developers, attempting to fine tune performance at the application program level is quite difficult. [3]
Given that MPI has now emerged as the de facto standard on computer clusters, the increase in the number of cluster nodes has resulted in continued research to improve the efficiency and scalability of MPI libraries. These efforts have included research to reduce the memory footprint of MPI libraries. [7]
From the earliest days MPI provided facilities for performance profiling via the PMPI "profiling system". [12] The use of the PMIPI- prefix allows for the observation of the entry and exit points for messages. However, given the high level nature of this profile, this type of information only provides a glimpse at the real behavior of the communication system. The need for more information resulted in the development of the MPI-Peruse system. Peruse provides a more detailed profile by enabling applications to gain access to state-changes within the MPI-library. This is achieved by registering callbacks with Peruse, and then invoking them as triggers as message events take place. [13] Peruse can work with the PARAVER visualization system. PARAVER has two components, a trace component and a visual component for analyze the traces, the statistics related to specific events, etc. [14] PARAVER may use trace formats from other systems, or perform its own tracing. It operates at the task level, thread level, and in a hybrid format. Traces often include so much information that they are often overwhelming. Thus PARAVER summarizes them to allow users to visualize and analyze them. [13] [14] [15]
When a large scale, often supercomputer level, parallel system is being developed, it is essential to be able to experiment with multiple configurations and simulate performance. There are a number of approaches to modeling message passing efficiency in this scenario, ranging from analytical models to trace-based simulation and some approaches rely on the use of test environments based on "artificial communications" to perform synthetic tests of message passing performance. [3] Systems such as BIGSIM provide these facilities by allowing the simulation of performance on various node topologies, message passing and scheduling strategies. [4]
At the analytical level, it is necessary to model the communication time T in term of a set of subcomponents such as the startup latency, the asymptotic bandwidth and the number of processors. A well known model is Hockney's model which simply relies on point to point communication, using T = L + (M / R) where M is the message size, L is the startup latency and R is the asymptotic bandwidth in MB/s. [16]
Xu and Hwang generalized Hockney's model to include the number of processors, so that both the latency and the asymptotic bandwidth are functions of the number of processors. [16] [17] Gunawan and Cai then generalized this further by introducing cache size, and separated the messages based on their sizes, obtaining two separate models, one for messages below cache size, and one for those above. [16]
Specific tools may be used to simulate and understand the performance of message passing on computer clusters. For instance, CLUSTERSIM uses a Java-based visual environment for discrete-event simulation. In this approach computed nodes and network topology is visually modeled. Jobs and their duration and complexity are represented with specific probability distributions allowing various parallel job scheduling algorithms to be proposed and experimented with. The communication overhead for MPI message passing can thus be simulated and better understood in the context of large-scale parallel job execution. [18]
Other simulation tools include MPI-sim and BIGSIM. [19] MPI-Sim is an execution-driven simulator that requires C or C++ programs to operate. [18] [19] ClusterSim, on the other hand uses a hybrid higher-level modeling system independent of the programming language used for program execution. [18]
Unlike MPI-Sim, BIGSIM is a trace-driven system that simulates based on the logs of executions saved in files by a separate emulator program. [5] [19] BIGSIM includes an emulator, and a simulator. The emulator executes applications on a small number of nodes and stores the results, so the simulator can use them and simulate activities on a much larger number of nodes. [5] The emulator stores information of sequential execution blocks (SEBs) for multiple processors in log files, with each SEB recording the messages sent, their sources and destinations, dependencies, timings, etc. The simulator reads the log files and simulates them, and may star additional messages which are then also stored as SEBs. [4] [5] The simulator can thus provide a view of the performance of very large applications, based on the execution traces provided by the emulator on a much smaller number of nodes, before the entire machine is available, or configured. [5]
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there have existed supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.
A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.
Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.
Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.
Quadrics was a supercomputer company formed in 1996 as a joint venture between Alenia Spazio and the technical team from Meiko Scientific. They produced hardware and software for clustering commodity computer systems into massively parallel systems. Their highpoint was in June 2003 when six out of the ten fastest supercomputers in the world were based on Quadrics' interconnect. They officially closed on June 29, 2009.
In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.
In computer networking, telecommunication and information theory, broadcasting is a method of transferring a message to all recipients simultaneously. Broadcasting can be performed as a high-level operation in a program, for example, broadcasting in Message Passing Interface, or it may be a low-level networking operation, for example broadcasting on Ethernet.
The Parallel Virtual File System (PVFS) is an open-source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for use in large scale cluster computing. PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI). PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center. PVFS development has been funded by NASA Goddard Space Flight Center, The DOE Office of Science Advanced Scientific Computing Research program, NSF PACI and HECURA programs, and other government and private agencies. PVFS is now known as OrangeFS in its newest development branch.
Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.
HPX, short for High Performance ParalleX, is a runtime system for high-performance computing. It is currently under active development by the STE||AR group at Louisiana State University. Focused on scientific computing, it provides an alternative execution model to conventional approaches such as MPI. HPX aims to overcome the challenges MPI faces with increasing large supercomputers by using asynchronous communication between nodes and lightweight control objects instead of global barriers, allowing application developers to exploit fine-grained parallelism.
A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
Róbert Lovas is a Hungarian computer scientist at SZTAKI, Budapest, Hungary.
Japan operates a number of centers for supercomputing which hold world records in speed, with the K computer becoming the world's fastest in June 2011. and Fugaku took the lead in June 2020, and furthered it, as of November 2020, to 3 times faster than number two computer.
Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.
A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu.
BIGSIM is a computer simulation and performance modeling system for parallel computing, typically used for very large computer clusters. BIGSIM was developed at the University of Illinois.
In computer science, trace-based simulation refers to system simulation performed by looking at traces of program execution or system component access with the purpose of performance prediction.
SUPRENUM was a German research project to develop a parallel computer from 1985 through 1990. It was a major effort which was aimed at developing a national expertise in massively parallel processing both at hardware and at software level.
The high performance supercomputing program started in mid-to-late 1980s in Pakistan. Supercomputing is a recent area of Computer science in which Pakistan has made progress, driven in part by the growth of the information technology age in the country. Developing on the ingenious supercomputer program started in 1980s when the deployment of the Cray supercomputers was initially denied.
SHMEM is a family of parallel programming libraries, providing one-sided, RDMA, parallel-processing interfaces for low-latency distributed-memory supercomputers. The SHMEM acronym was subsequently reverse engineered to mean "Symmetric Hierarchical MEMory”. Later it was expanded to distributed memory parallel computer clusters, and is used as parallel programming interface or as low-level interface to build partitioned global address space (PGAS) systems and languages. “Libsma”, the first SHMEM library, was created by Richard Smith at Cray Research in 1993 as a set of thin interfaces to access the CRAY T3D's inter-processor-communication hardware. SHMEM has been implemented by Cray Research, SGI, Cray Inc., Quadrics, HP, GSHMEM, IBM, QLogic, Mellanox, Universities of Houston and Florida; there is also open-source OpenSHMEM.
{{cite web}}
: CS1 maint: archived copy as title (link)