Comparison of cluster software

Last updated March 06, 2024

The following tables compare general and technical information for notable computer cluster software. This software can be grossly separated in four categories: Job scheduler, nodes management, nodes installation and integrated stack (all the above).

General information

Software	Maintainer	Category	Development status	Latest release	ArchitectureOCS	High-Performance / High-Throughput Computing	License	Platforms supported	Cost	Paid support available
Amoeba			No active development				MIT
Base One Foundation Component Library							Proprietary
DIET	INRIA, SysFera, Open Source	All in one			GridRPC, SPMD, Hierarchical and distributed architecture, CORBA	HTC/HPC	CeCILL	Unix-like, Mac OS X, AIX	Free
DxEnterprise	DH2i	Nodes management	Actively developed	v23.0			Proprietary	Windows 2012R2/2016/2019/2022 and 8+, RHEL 7/8/9, CentOS 7, Ubuntu 16.04/18.04/20.04/22.04, SLES 15.4	Cost	Yes
Enduro/X	Mavimax, Ltd.	Job/Data Scheduler	actively developed		SOA Grid	HTC/HPC/HA	GPLv2 or Commercial	Linux, FreeBSD, MacOS, Solaris, AIX	Free / Cost	Yes
Ganglia		Monitoring	actively developed	3.7.2^[1] 14 June 2016;7 years ago (14 June 2016)			BSD	Unix, Linux, Microsoft Windows NT/XP/2000/2003/2008, FreeBSD, NetBSD, OpenBSD, DragonflyBSD, Mac OS X, Solaris, AIX, IRIX, Tru64, HPUX.	Free
Grid MP	Univa (formerly United Devices)	Job Scheduler	no active development		Distributed master/worker	HTC/HPC	Proprietary	Windows, Linux, Mac OS X, Solaris	Cost
Apache Mesos	Apache		actively developed				Apache license v2.0	Linux	Free	Yes
Moab Cluster Suite	Adaptive Computing	Job Scheduler	actively developed			HPC	Proprietary	Linux, Mac OS X, Windows, AIX, OSF/Tru-64, Solaris, HP-UX, IRIX, FreeBSD & other UNIX platforms	Cost	Yes
NetworkComputer	Runtime Design Automation		actively developed			HTC/HPC	Proprietary	Unix-like, Windows	Cost
OpenHPC	OpenHPC project	all in one	actively developed	v2.61 February 2, 2023;13 months ago (2023-02-02)		HPC		Linux (CentOS / OpenSUSE Leap)	Free	No
OpenLava	None. Formerly Teraproc	Job Scheduler	Halted by injunction		Master/Worker, multiple admin/submit nodes	HTC/HPC	Illegal due to being a pirated version of IBM Spectrum LSF	Linux	Not legally available	No
PBS Pro	Altair	Job Scheduler	actively developed		Master/worker distributed with fail-over	HPC/HTC	AGPL or Proprietary	Linux, Windows	Free or Cost	Yes
Proxmox Virtual Environment	Proxmox Server Solutions	Complete	actively developed				Open-source AGPLv3	Linux, Windows, other operating systems are known to work and are community supported	Free	Yes
Rocks Cluster Distribution	Open Source/NSF grant	All in one	actively developed	7.0^[2] (Manzanita) 1 December 2017;6 years ago (1 December 2017)		HTC/HPC	OpenSource	CentOS	Free
Popular Power
ProActive	INRIA, ActiveEon, Open Source	All in one	actively developed		Master/Worker, SPMD, Distributed Component Model, Skeletons	HTC/HPC	GPL	Unix-like, Windows, Mac OS X	Free
RPyC	Tomer Filiba		actively developed				MIT License	*nix/Windows	Free
SLURM	SchedMD	Job Scheduler	actively developed	v23.11.3 January 24, 2024;41 days ago (2024-01-24)		HPC/HTC	GPL	Linux/*nix	Free	Yes
Spectrum LSF	IBM	Job Scheduler	actively developed		Master node with failover/exec clients, multiple admin/submit nodes, Suite addOns	HPC/HTC	Proprietary	Unix, Linux, Windows	Cost and Academic - model - Academic, Express, Standard, Advanced and Suites	Yes
Oracle Grid Engine \| Oracle Grid Engine (Sun Grid Engine, SGE)	Altair	Job Scheduler	active Development moved to Altair Grid Engine		Master node/exec clients, multiple admin/submit nodes	HPC/HTC	Proprietary	*nix/Windows	Cost
Some Grid Engine / Son of Grid Engine / Sun Grid Engine	daimh	Job Scheduler	actively developed (stable/maintenance)		Master node/exec clients, multiple admin/submit nodes	HPC/HTC	Open-source SISSL	*nix	Free	No
SynfiniWay	Fujitsu		actively developed			HPC/HTC	?	Unix, Linux, Windows	Cost
Techila Distributed Computing Engine	Techila Technologies Ltd.	All in one	actively developed		Master/worker distributed	HTC	Proprietary	Linux, Windows	Cost	Yes
TORQUE Resource Manager	Adaptive Computing	Job Scheduler	actively developed				Proprietary	Linux, *nix	Cost	Yes
UniCluster	Univa	All in One	Functionality and development moved to UniCloud (see above)						Free	Yes
UNICORE
Xgrid	Apple Computer
Warewulf		Provision and clusters management	actively developed	v4.4.1 July 6, 2023;7 months ago (2023-07-06)		HPC	Open Source	Linux	Free
xCAT		Provision and clusters management	actively developed	v2.16.5 March 7, 2023;11 months ago (2023-03-07)		HPC	Eclipse Public License	Linux	Free
Software	Maintainer	Category	Development status	Latest release	Architecture	High-Performance/ High-Throughput Computing	License	Platforms supported	Cost	Paid support available

Table explanation

Software: The name of the application that is described

Technical information

Software	Implementation Language	Authentication	Encryption	Integrity	Global File System	Global File System + Kerberos	Heterogeneous/ Homogeneous exec node	Jobs priority	Group priority	Queue type	SMP aware	Max exec node	Max job submitted	CPU scavenging	Parallel job	Job checkpointing	Python interface
Enduro/X	C/C++	OS Authentication	GPG, AES-128, SHA1	None	Any cluster Posix FS (gfs, gpfs, ocfs, etc.)	Any cluster Posix FS (gfs, gpfs, ocfs, etc.)	Heterogeneous	OS Nice level	OS Nice level	SOA Queues, FIFO	Yes	OS Limits	OS Limits	Yes	Yes	No	No
HTCondor	C++	GSI, SSL, Kerberos, Password, File System, Remote File System, Windows, Claim To Be, Anonymous	None, Triple DES, BLOWFISH	None, MD5	None, NFS, AFS	Not official, hack with ACL and NFS4	Heterogeneous	Yes	Yes	Fair-share with some programmability	basic (hard separation into different node)	tested ~10000?	tested ~100000?	Yes	MPI, OpenMP, PVM	Yes	Yes, and native Python Binding
PBS Pro	C/Python	OS Authentication, Munge			Any, e.g., NFS, Lustre, GPFS, AFS	Limited availability	Heterogeneous	Yes	Yes	Fully configurable	Yes	tested ~50,000	Millions	Yes	MPI, OpenMP	Yes	Yes
OpenLava	C/C++	OS authentication	None		NFS		Heterogeneous Linux	Yes	Yes	Configurable	Yes			Yes, supports preemption based on priority	Yes	Yes	No
Slurm	C	Munge, None, Kerberos					Heterogeneous	Yes	Yes	Multifactor Fair-share	yes	tested 120k	tested 100k	No	Yes	Yes	PySlurm
Spectrum LSF	C/C++	Multiple - OS Authentication/Kerberos	Optional	Optional	Any - GPFS/Spectrum Scale, NFS, SMB	Any - GPFS/Spectrum Scale, NFS, SMB	Heterogeneous - HW and OS agnostic (AIX, Linux or Windows)	Policy based - no queue to computenode binding	Policy based - no queue to computegroup binding	Batch, interactive, checkpointing, parallel and combinations	yes and GPU aware (GPU License free)	> 9.000 compute hots	> 4 mio jobs a day	Yes, supports preemption based on priority, supports checkpointing/resume	Yes, fx parallel submissions for job collaboration over fx MPI	Yes, with support for user, kernel or library level checkpointing environments	Yes
Torque	C	SSH, munge			None, any		Heterogeneous	Yes	Yes	Programmable	Yes	tested	tested	Yes	Yes	Yes	Yes
Software	Implementation Language	Authentication	Encryption	Integrity	Global File System	Global File System + Kerberos	Heterogeneous/ Homogeneous exec node	Jobs priority	Group priority	Queue type	SMP aware	Max exec node	Max job submitted	CPU scavenging	Parallel job	Job checkpointing

Table Explanation

Software: The name of the application that is described
SMP aware:
- basic: hard split into multiple virtual host
- basic+: hard split into multiple virtual host with some minimal/incomplete communication between virtual host on the same computer
- dynamic: split the resource of the computer (CPU/Ram) on demand

Related Research Articles

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

A web hosting service is a type of Internet hosting service that hosts websites for clients, i.e. it offers the facilities required for them to create and maintain a site and makes it accessible on the World Wide Web. Companies providing web hosting services are sometimes called web hosts.

Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system.

A shared-nothing architecture (SN) is a distributed computing architecture in which each update request is satisfied by a single node in a computer cluster. The intent is to eliminate contention among nodes. Nodes do not share the same memory or storage. One alternative architecture is shared everything, in which requests are satisfied by arbitrary combinations of nodes. This may introduce contention, as multiple nodes may seek to update the same data at the same time.

MOSIX is a proprietary distributed operating system. Although early versions were based on older UNIX systems, since 1999 it focuses on Linux clusters and grids. In a MOSIX cluster/grid there is no need to modify or to link applications with any library, to copy files or login to remote nodes, or even to assign processes to different nodes – it is all done automatically, like in an SMP.

In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented for more limited purposes, just job scheduling for instance, which may be achieved by means of an additional layer of software over conventional operating system images running on each node. The interest in SSI clusters is based on the perception that they may be simpler to use and administer than more specialized clusters.

Utility computing, or computer utility, is a service provisioning model in which a service provider makes computing resources and infrastructure management available to the customer as needed, and charges them for specific usage rather than a flat rate. Like other types of on-demand computing, the utility model seeks to maximize the efficient use of resources and/or minimize associated costs. Utility is the packaging of system resources, such as computation, storage and services, as a metered service. This model has the advantage of a low or no initial cost to acquire computer resources; instead, resources are essentially rented.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

A grid file system is a computer file system whose goal is improved reliability and availability by taking advantage of many smaller file storage areas.

Within cluster and parallel computing, a cluster manager is usually backend graphical user interface (GUI) or command-line interface (CLI) software that runs on a set of cluster nodes that it manages. The cluster manager works together with a cluster management agent. These agents run on each node of the cluster to manage and configure services, a set of services, or to manage and configure the complete cluster server itself In some cases the cluster manager is mostly used to dispatch work for the cluster to perform. In this last case a subset of the cluster manager can be a remote desktop application that is used not for configuration but just to send work and get back work results from a cluster. In other cases the cluster is more related to availability and load balancing than to computational or specific service clusters.

The China National Grid (CNGrid) is the Chinese national high performance computing network supported by 863 Program.

The Base One Foundation Component Library (BFC) is a rapid application development toolkit for building secure, fault-tolerant, database applications on Windows and ASP.NET. In conjunction with Microsoft's Visual Studio integrated development environment, BFC provides a general-purpose web application framework for working with databases from Microsoft, Oracle, IBM, Sybase, and MySQL, running under Windows, Linux/Unix, or IBM iSeries or z/OS. BFC includes facilities for distributed computing, batch processing, queuing, and database command scripting, and these run under Windows or Linux with Wine.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each of which is a data center. Cloud computing relies on sharing of resources to achieve coherence and typically uses a pay-as-you-go model, which can help in reducing capital expenses but may also lead to unexpected operating expenses for users.

Eucalyptus is a paid and open-source computer software for building Amazon Web Services (AWS)-compatible private and hybrid cloud computing environments, originally developed by the company Eucalyptus Systems. Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems. Eucalyptus enables pooling compute, storage, and network resources that can be dynamically scaled up or down as application workloads change. Mårten Mickos was the CEO of Eucalyptus. In September 2014, Eucalyptus was acquired by Hewlett-Packard and then maintained by DXC Technology. After DXC stopped developing the product in late 2017, AppScale Systems forked the code and started supporting Eucalyptus customers.

gLite is a middleware computer software project for grid computing used by the CERN LHC experiments and other scientific domains. It was implemented by collaborative efforts of more than 80 people in 12 different academic and industrial research centers in Europe. gLite provides a framework for building applications tapping into distributed computing and storage resources across the Internet. The gLite services were adopted by more than 250 computing centres, and used by more than 15000 researchers in Europe and around the world.

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data centre, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[wikidata-eba30468f7d27062f8b114ef83f1a6bbfb3fc7f7-v10-1] "Release 3.7.2".

[wikidata-1079c78f5d22f29fd46d1c4f4f14011c43cb89c1-v10-2] "Rocks 7.0 is Released". 1 December 2017. Retrieved 17 November 2022.

[1]

[2]

Comparison of cluster software

Contents

General information

Technical information

See also

Related Research Articles