Gang scheduling

Last updated

In computer science, gang scheduling is a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors. Usually these will be threads all belonging to the same process, but they may also be from different processes, where the processes could have a producer-consumer relationship or come from the same MPI program.

Contents

Gang scheduling is used to ensure that if two or more threads or processes communicate with each other, they will all be ready to communicate at the same time. If they were not gang-scheduled, then one could wait to send or receive a message to another while it is sleeping, and vice versa. When processors are over-subscribed and gang scheduling is not used within a group of processes or threads which communicate with each other, each communication event could suffer the overhead of a context switch.

Gang scheduling is based on a data structure called the Ousterhout matrix. In this matrix each row represents a time slice, and each column a processor. The threads or processes of each job are packed into a single row of the matrix. [1] During execution, coordinated context switching is performed across all nodes to switch from the processes in one row to those in the next row.

Gang scheduling is stricter than coscheduling. [2] It requires all threads of the same process to run concurrently, while coscheduling allows for fragments, which are sets of threads that do not run concurrently with the rest of the gang.

Gang scheduling was implemented and used in production mode on several parallel machines, most notably the Connection Machine CM-5.

Types

Bag of gangs (BoG)

In gang scheduling, one to one mapping happens, which means each task will be mapped to a processor. Usually, jobs are considered as independent gangs, but with a bag of gangs scheme, all the gangs can be combined and sent together to the system. When jobs are executed in the system, the execution can never be completed until and unless all the gangs that belong to the same BoG complete their executions. [3] Thus, if one gang belonging to some job completes its execution, it will have to wait until all the gangs complete their executions. This leads to increased synchronization delay overhead.

Response time of Bag of Gangs is defined as the time interval from the arrival of the BoG at the grid dispatcher to the completion of jobs of all of the sub-gangs which belong to the BoG. The average response time is defined as follows:

Response Time (RT)=. [3]

The response time is further affected when a priority job arrives. Whenever a priority job arrives at the system, that job will be given priority with respect to all other jobs, even over the ones which are currently being executed on the processors. In this case, when a priority job arrives, the sub-gang which is currently executing on the system will be stopped and all the progress that has been made will be lost and need to be redone. This interruption of the job will further delay the total response time of the BoG. [3]

Adapted first come first served (AFCFS)

Adapted first come first served (AFCFS) scheme is the adapted version of first come and first serve scheme. As per the first-come, first-served scheme, whichever job that comes first will be forwarded for execution. But in the AFCFS scheme, once a job arrives at the system, the job will not be scheduled unless and until enough processors are available for the execution of the respective job. [3] When a large job arrives at the system and is present at the start of the ready queue but not enough processors are available, then an AFCFS policy will schedule the smaller job for which enough processors are available, even if that job is at the back of the queue. In other words, this scheme favors smaller jobs as compared to larger jobs based on the availability of processor, thus this will leads to increased fragmentation in the system. [3] [4]

Largest gang first served (LGFS)

In the above execution scheme, the tasks which correspond to increasing job size are placed in a queue, with the tasks belonging to the largest gang scheduled first, but this method of execution tends to lead to the starvation of resources of smaller jobs and is therefore unfit to be executed in systems where the number of processors is comparatively low. [5]

The AFCFS and LGFS also have to deal with possible processor failure. In such a case, tasks executing on that processor are submitted to other processors for execution. The tasks wait in the head of the queue on these processors while the current processor is being repaired.

There are two scenarios which emerge from the above issue: [5]

Paired gang scheduling

Gang scheduling while executing the I/O bound processes keeps the CPUs idle while awaiting responses from the other processors, whereas the idle processors can be utilized for executing tasks. If the gang consists of a mix of CPU and I/O Processes, since these processes interfere little in each other’s operation, algorithms can be defined to keep both the CPU and the I/O busy at the same time and exploiting parallelism. This method would present the idea of matching pairs of gangs, one I/O based and one CPU bound. Each gang would assume that it is working in isolation as they utilize different devices. [6]

Scheduling algorithm

  • General case: In the general case, a central node is designated in the network to handle task allocation and the resource allocation. It maintains the information in an Ousterhout matrix. In strict gang scheduling, one row is selected at a time following which a node scheduler schedules a process in the respective cell of that row. [6]
  • Paired gang: In paired gang scheduling, two rows are selected instead of one, one each of the I/O bound gang and CPU gang. It is at the discretion of the local scheduler to allot jobs to the appropriate processors in order to elicit maximum allowed parallelism. [6]

Synchronization methods

Concurrent gang scheduling (CGS)

Concurrent gang scheduling a highly scalable and versatile algorithm and assumes the existence of a synchronizer utilizing the internal clock of each node. CGS primarily consists of the following three components. [7]

The synchronization algorithm is performed in two stages. [7]

We assume the existence of a synchronizer that sends the signal to all the nodes in a cluster at a constant interval. The CGS utilizes the fact that the most common events which occur in a PC are timer interrupts and they use the same parameter to be the internal clock. [7]

SHARE scheduling system

The SHARE scheduling system utilizes the internal clock system of each node and is synchronized using the NTP Protocol. The flavor of scheduling is implemented by collecting jobs with same resource requirements in a group and executing the same for a pre-defined time-slice. Incomplete jobs are pre-empted after the time slice is exhausted. [8]

The local memory of the node is utilized as the swap space for pre-empted jobs. The main advantages of the SHARE scheduled system are that it guarantees the service time for accepted jobs and supports both batch and interactive jobs.

Synchronization:

Each gang of processes utilizing the same resources are mapped to a different processor. The SHARE system primarily consists of three collaborating modules. [8]

Packing criteria

A new slot is created when we cannot pack the job into the available slot. In case, a new slot is opened even if the job can be packed in the available slot, then the run fraction which is equal to one over the number of slots used will increase. Therefore, certain algorithms have been devised on packing criteria and are mentioned below:

Capacity based algorithm

This algorithm monitors the slots capacity and decides whether there is any need of opening a new slot. There are two variants on this algorithm:

1. First fit. Here, the used slots are checked for capacity in a sequential order then the first one which is having sufficient capacity is chosen. If none of the available slots have enough capacity, a new slot is opened and the processing elements (PE) are allocated in the slot in sequential order. [9]

2. Best fit. Unlike first fit, the used slots are sorted based on capacity, but not in sequential order. The slot with the smallest sufficient capacity is chosen. If none of the used slots have sufficient capacity, then only one new slot is opened. Once the new slot is opened, the processing elements (PEs) are allocated in the slot in sequential order as per the previous algorithm. [9]

Left-right based algorithms

This algorithm is a modified version of the best fit algorithm. In the best fit algorithm, the PEs are allocated in a sequential order, but in this algorithm, the PEs can be inserted from both directions so as to reduce the overlap between different sets of PEs assigned to different jobs. [9]

1. Left-right by size. Here, the PEs can be inserted in sequential order and in reverse sequential order based on the size of the job. If the size of the job is small, the PEs are inserted from left to right and if the job is large, the PEs are inserted from right to left. [9]

2. Left-right by slots. Unlike the previous algorithm, where the choice was based on the size of the job, here the choice is dependent on the slot. Now, slots are indicated as being filled, i.e. being filled from the left or from the right. The PEs are allocated to the job in the same order. The number of slots on both sides is approximately equal, so when a new slot is opened, the direction is indicated based on the number of slots in both direction. [9]

Load based algorithms

Both the capacity-based and left-right based algorithms do not accommodate the load on individual PEs. Load-based algorithms take into account the load on the individual PE while tracking the overlap between sets of PEs assigned to different jobs. [9]

1. Minimal maximum load. In this scheme, PEs are sorted based on the load on them that each job will have on the PEs. The availability of the free PEs in the slot determines the capacity of the slot. Suppose that PEs are allocated to a job which has threads, the PE in the load order (last one) will determine the maximum load that any PE can have which is available in the slot. The slot which has minimal maximum load on any PE is chosen and a number of least loaded free PEs are used in the slot. [9]

2. Minimal average load. Unlike the previous scheme, in which slots were chosen based on the minimal maximum load on PE, here slots are chosen based on the average of the load on the least loaded PEs. [9]

Buddy based algorithm

In this algorithm the PEs are assigned in clusters, not individually. The PEs are first partitioned into groups that are power of two. Each member of the group will be assigned a controller and when a job of size n arrives, it is assigned to a controller of size 2[lg 2] (the smallest power to 2 that is larger than or equal to n). The controller is assigned by first sorting all the used slots, and then identifying groups of 2[lg 2] contiguous free processors. If a controller has all the PEs free in some of the slots, then only a newly arrived job will be assigned to that controller. Otherwise a new slot is opened. [9]

Migration based algorithm

In all the above-mentioned algorithms, the initial placement policy is fixed and jobs are allocated to the PEs based on that. However, this scheme migrates jobs from one set of PEs to another set of PEs, which in turn improves the run fraction of the system. [9]

See also

Related Research Articles

<span class="mw-page-title-main">Mutual exclusion</span>

In computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions. It is the requirement that one thread of execution never enters a critical section while a concurrent thread of execution is already accessing said critical section, which refers to an interval of time during which a thread of execution accesses a shared resource or shared memory.

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. The multiple threads of a given process may be executed concurrently, sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

<span class="mw-page-title-main">Load balancing (computing)</span> Set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

<span class="mw-page-title-main">Earth Simulator</span>

The Earth Simulator (ES), developed by the Japanese government's initiative "Earth Simulator Project", was a highly parallel vector supercomputer system for running global climate models to evaluate the effects of global warming and problems in solid earth geophysics. The system was developed for Japan Aerospace Exploration Agency, Japan Atomic Energy Research Institute, and Japan Marine Science and Technology Center (JAMSTEC) in 1997. Construction started in October 1999, and the site officially opened on 11 March 2002. The project cost 60 billion yen.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computing, scheduling is the action of assigning resources to perform tasks. The resources may be processors, network links or expansion cards. The tasks may be threads, processes or data flows.

<span class="mw-page-title-main">Race condition</span> When a systems behavior depends on timing of uncontrollable events

A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior, so parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. One way to do so is known as a critical section or critical region. This protected section cannot be entered by more than one process or thread at a time; others are suspended until the first leaves the critical section. Typically, the critical section accesses a shared resource, such as a data structure, a peripheral device, or a network connection, that would not operate correctly in the context of multiple concurrent accesses.

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. Precisely, a topological sort is a graph traversal in which each node v is visited only after all its dependencies are visited. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Topological sorting has many applications especially in ranking problems such as feedback arc set. Topological sorting is possible even when the DAG has disconnected components.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

<span class="mw-page-title-main">CPU time</span> Time used by a computer

CPU time is the amount of time for which a central processing unit (CPU) was used for processing instructions of a computer program or operating system, as opposed to elapsed time, which includes for example, waiting for input/output (I/O) operations or entering low-power (idle) mode. The CPU time is measured in clock ticks or seconds. Often, it is useful to measure CPU time as a percentage of the CPU's capacity, which is called the CPU usage. CPU time and CPU usage have two main uses.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

Loop-level parallelism is a form of parallelism in software programming that is concerned with extracting parallel tasks from loops. The opportunity for loop-level parallelism often arises in computing programs where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time. Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

<span class="mw-page-title-main">Slurm Workload Manager</span> Free and open-source job scheduler for Linux and similar computers

The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

High performance computing applications run on massively parallel supercomputers consist of concurrent programs designed using multi-threaded, multi-process models. The applications may consist of various constructs with varying degree of parallelism. Although high performance concurrent programs use similar design patterns, models and principles as that of sequential programs, unlike sequential programs, they typically demonstrate non-deterministic behavior. The probability of bugs increases with the number of interactions between the various parallel constructs. Race conditions, data races, deadlocks, missed signals and live lock are common error types.

In parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors. It does so efficiently in terms of execution time, memory usage, and inter-processor communication.

In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysis of parallel algorithms is similar to the analysis of sequential algorithms, but is generally more involved because one must reason about the behavior of multiple cooperating threads of execution. One of the primary goals of parallel analysis is to understand how a parallel algorithm's use of resources changes as the number of processors is changed.

A central problem in algorithmic graph theory is the shortest path problem. Hereby, the problem of finding the shortest path between every pair of nodes is known as all-pair-shortest-paths (APSP) problem. As sequential algorithms for this problem often yield long runtimes, parallelization has shown to be beneficial in this field. In this article two efficient algorithms solving this problem are introduced.

References

  1. Dror G. Feitelson (1996). Packing schemes for gang scheduling. In Job Scheduling Strategies for Parallel Processing, Springer-Verlag Lecture Notes in Computer Science Vol. 1162, pp. 89-110.
  2. Feitelson, Dror G.; Rudolph, Larry (1992). "Gang Scheduling Performance Benefits for Fine-Grain Synchronization". Journal of Parallel and Distributed Computing. 16 (4): 306–318. CiteSeerX   10.1.1.79.7070 . doi:10.1016/0743-7315(92)90014-e.
  3. 1 2 3 4 5 Papazachos, Zafeirios C.; Karatza, Helen D. (August 2010). "Performance evaluation of bag of gangs scheduling in a heterogeneous distributed system". Journal of Systems and Software. 83 (8): 1346–1354. doi:10.1016/j.jss.2010.01.009.
  4. Zafeirios C. Papazachos, Helen D. Karatza, "Performance evaluation of gang scheduling in a two-cluster system with migrations", IPDPS, 2009, Parallel and Distributed Processing Symposium, International, Parallel and Distributed Processing Symposium, International 2009, pp. 1-8, doi : 10.1109/IPDPS.2009.5161172
  5. 1 2 3 4 "Performance Analysis of Gang Scheduling in a Distributed System under Processor Failures" (PDF).
  6. 1 2 3 "Paired Gang Scheduling" (PDF).
  7. 1 2 3 Hyoudou, Kazuki; Kozakai, Yasuyuki; Nakayama, Yasuichi (2007). "An Implementation of a Concurrent Gang Scheduler for a PC-Based Cluster System". Systems and Computers in Japan. 38 (3): 39–48. doi:10.1002/scj.20458.
  8. 1 2 Society, Ieee Computer (1996). Gang Scheduling for Highly Efficient Distributed Multiprocessor Systems. Frontiers '96. pp. 4–. ISBN   9780818675515.
  9. 1 2 3 4 5 6 7 8 9 10 "Packing Schemes for Gang Scheduling" (PDF).