Comparison of cluster software

Last updated

The following tables compare general and technical information for notable computer cluster software. This software can be grossly separated in four categories: Job scheduler, nodes management, nodes installation and integrated stack (all the above).

Contents

General information


SoftwareMaintainerCategoryDevelopment statusLatest releaseArchitectureOCS High-Performance / High-Throughput Computing License Platforms supportedCostPaid support available
Amoeba No active development MIT
Base One Foundation Component Library Proprietary
DIET INRIA, SysFera, Open SourceAll in oneGridRPC, SPMD, Hierarchical and distributed architecture, CORBAHTC/HPC CeCILL Unix-like, Mac OS X, AIX Free
DxEnterprise DH2i Nodes managementActively developedv23.0Proprietary Windows 2012R2/2016/2019/2022 and 8+, RHEL 7/8/9, CentOS 7, Ubuntu 16.04/18.04/20.04/22.04, SLES 15.4CostYes
Enduro/X Mavimax, Ltd.Job/Data Scheduleractively developedSOA GridHTC/HPC/HAGPLv2 or CommercialLinux, FreeBSD, MacOS, Solaris, AIXFree / CostYes
Ganglia Monitoringactively developed3.7.2 [1] OOjs UI icon edit-ltr-progressive.svg 14 June 2016;8 years ago BSD Unix, Linux, Microsoft Windows NT/XP/2000/2003/2008, FreeBSD, NetBSD, OpenBSD, DragonflyBSD, Mac OS X, Solaris, AIX, IRIX, Tru64, HPUX.Free
Grid MP Univa (formerly United Devices)Job Schedulerno active developmentDistributed master/workerHTC/HPC Proprietary Windows, Linux, Mac OS X, SolarisCost
Apache Mesos Apacheactively developed Apache license v2.0LinuxFreeYes
Moab Cluster Suite Adaptive ComputingJob Scheduleractively developedHPC Proprietary Linux, Mac OS X, Windows, AIX, OSF/Tru-64, Solaris, HP-UX, IRIX, FreeBSD & other UNIX platformsCostYes
NetworkComputer Runtime Design Automationactively developedHTC/HPC Proprietary Unix-like, Windows Cost
OpenHPC OpenHPC projectall in oneactively developedv2.61 February 2, 2023;19 months agoHPCLinux (CentOS / OpenSUSE Leap)FreeNo
OpenLava None. Formerly TeraprocJob SchedulerHalted by injunctionMaster/Worker, multiple admin/submit nodesHTC/HPCIllegal due to being a pirated version of IBM Spectrum LSF LinuxNot legally availableNo
PBS Pro AltairJob Scheduleractively developedMaster/worker distributed with fail-overHPC/HTCAGPL or ProprietaryLinux, WindowsFree or CostYes
Proxmox Virtual Environment Proxmox Server SolutionsCompleteactively developedOpen-source AGPLv3Linux, Windows, other operating systems are known to work and are community supportedFreeYes
Rocks Cluster Distribution Open Source/NSF grantAll in oneactively developed7.0 [2]   OOjs UI icon edit-ltr-progressive.svg (Manzanita) 1 December 2017;6 years agoHTC/HPCOpenSource CentOS Free
Popular Power
ProActive INRIA, ActiveEon, Open SourceAll in oneactively developedMaster/Worker, SPMD, Distributed Component Model, SkeletonsHTC/HPC GPL Unix-like, Windows, Mac OS X Free
RPyC Tomer Filibaactively developed MIT License *nix/WindowsFree
SLURM SchedMD Job Scheduleractively developedv23.11.3 January 24, 2024;7 months agoHPC/HTCGPLLinux/*nixFreeYes
Spectrum LSF IBM Job Scheduleractively developedMaster node with failover/exec clients, multiple admin/submit nodes, Suite addOnsHPC/HTC Proprietary Unix, Linux, Windows Cost and Academic - model - Academic, Express, Standard, Advanced and SuitesYes
Oracle Grid Engine | Oracle Grid Engine (Sun Grid Engine, SGE) AltairJob Scheduleractive Development moved to Altair Grid EngineMaster node/exec clients, multiple admin/submit nodesHPC/HTC Proprietary *nix/WindowsCost
Some Grid Engine / Son of Grid Engine / Sun Grid EnginedaimhJob Scheduleractively developed (stable/maintenance)Master node/exec clients, multiple admin/submit nodesHPC/HTCOpen-source SISSL*nixFreeNo
SynfiniWay Fujitsu actively developedHPC/HTC ? Unix, Linux, Windows Cost
Techila Distributed Computing Engine Techila Technologies Ltd. All in oneactively developedMaster/worker distributedHTC Proprietary Linux, Windows CostYes
TORQUE Resource Manager Adaptive ComputingJob Scheduleractively developedProprietaryLinux, *nixCostYes
UniCluster Univa All in OneFunctionality and development moved to UniCloud (see above)FreeYes
UNICORE
Xgrid Apple Computer
Warewulf Provision and clusters managementactively developedv4.4.1 July 6, 2023;14 months agoHPCOpen SourceLinuxFree
xCATProvision and clusters managementactively developedv2.16.5 March 7, 2023;18 months agoHPCEclipse Public LicenseLinuxFree
SoftwareMaintainerCategoryDevelopment statusLatest releaseArchitecture High-Performance/ High-Throughput Computing License Platforms supportedCostPaid support available

Table explanation

Technical information

SoftwareImplementation LanguageAuthenticationEncryptionIntegrityGlobal File SystemGlobal File System + KerberosHeterogeneous/ Homogeneous exec nodeJobs priorityGroup priorityQueue typeSMP awareMax exec nodeMax job submitted CPU scavenging Parallel job Job checkpointing Python interface
Enduro/X C/C++ OS AuthenticationGPG, AES-128, SHA1NoneAny cluster Posix FS (gfs, gpfs, ocfs, etc.)Any cluster Posix FS (gfs, gpfs, ocfs, etc.)HeterogeneousOS Nice levelOS Nice levelSOA Queues, FIFOYesOS LimitsOS LimitsYesYesNoNo
HTCondor C++ GSI, SSL, Kerberos, Password, File System, Remote File System, Windows, Claim To Be, AnonymousNone, Triple DES, BLOWFISHNone, MD5None, NFS, AFSNot official, hack with ACL and NFS4HeterogeneousYesYesFair-share with some programmabilitybasic (hard separation into different node)tested ~10000?tested ~100000?YesMPI, OpenMP, PVMYes Yes, and native Python Binding
PBS Pro C/Python OS Authentication, MungeAny, e.g., NFS, Lustre, GPFS, AFSLimited availabilityHeterogeneousYesYesFully configurableYestested ~50,000MillionsYesMPI, OpenMPYes Yes
OpenLava C/C++OS authenticationNoneNFSHeterogeneous LinuxYesYesConfigurableYesYes, supports preemption based on priorityYesYesNo
Slurm C Munge, None, KerberosHeterogeneousYesYesMultifactor Fair-shareyestested 120ktested 100kNoYesYes PySlurm
Spectrum LSF C/C++Multiple - OS Authentication/KerberosOptionalOptionalAny - GPFS/Spectrum Scale, NFS, SMBAny - GPFS/Spectrum Scale, NFS, SMBHeterogeneous - HW and OS agnostic (AIX, Linux or Windows)Policy based - no queue to computenode bindingPolicy based - no queue to computegroup bindingBatch, interactive, checkpointing, parallel and combinationsyes and GPU aware (GPU License free)> 9.000 compute hots> 4 mio jobs a dayYes, supports preemption based on priority, supports checkpointing/resumeYes, fx parallel submissions for job collaboration over fx MPIYes, with support for user, kernel or library level checkpointing environments Yes
Torque CSSH, mungeNone, anyHeterogeneousYesYesProgrammableYestestedtestedYesYesYes Yes
SoftwareImplementation LanguageAuthenticationEncryptionIntegrityGlobal File SystemGlobal File System + KerberosHeterogeneous/ Homogeneous exec nodeJobs priorityGroup priorityQueue typeSMP awareMax exec nodeMax job submitted CPU scavenging Parallel job Job checkpointing

Table Explanation

See also

Related Research Articles

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system.

A shared-nothing architecture (SN) is a distributed computing architecture in which each update request is satisfied by a single node in a computer cluster. The intent is to eliminate contention among nodes. Nodes do not share the same memory or storage.

MOSIX is a proprietary distributed operating system. Although early versions were based on older UNIX systems, since 1999 it focuses on Linux clusters and grids. In a MOSIX cluster/grid there is no need to modify or to link applications with any library, to copy files or login to remote nodes, or even to assign processes to different nodes – it is all done automatically, like in an SMP.

In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented for more limited purposes, just job scheduling for instance, which may be achieved by means of an additional layer of software over conventional operating system images running on each node. The interest in SSI clusters is based on the perception that they may be simpler to use and administer than more specialized clusters.

Utility computing, or computer utility, is a service provisioning model in which a service provider makes computing resources and infrastructure management available to the customer as needed, and charges them for specific usage rather than a flat rate. Like other types of on-demand computing, the utility model seeks to maximize the efficient use of resources and/or minimize associated costs. Utility is the packaging of system resources, such as computation, storage and services, as a metered service. This model has the advantage of a low or no initial cost to acquire computer resources; instead, resources are essentially rented.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

A grid file system is a computer file system whose goal is improved reliability and availability by taking advantage of many smaller file storage areas.

Within cluster and parallel computing, a cluster manager is usually backend graphical user interface (GUI) or command-line interface (CLI) software that runs on a set of cluster nodes that it manages. The cluster manager works together with a cluster management agent. These agents run on each node of the cluster to manage and configure services, a set of services, or to manage and configure the complete cluster server itself In some cases the cluster manager is mostly used to dispatch work for the cluster to perform. In this last case a subset of the cluster manager can be a remote desktop application that is used not for configuration but just to send work and get back work results from a cluster. In other cases the cluster is more related to availability and load balancing than to computational or specific service clusters.

The China National Grid (CNGrid) is the Chinese national high performance computing network supported by 863 Program.

The Base One Foundation Component Library (BFC) is a rapid application development toolkit for building secure, fault-tolerant, database applications on Windows and ASP.NET. In conjunction with Microsoft's Visual Studio integrated development environment, BFC provides a general-purpose web application framework for working with databases from Microsoft, Oracle, IBM, Sybase, and MySQL, running under Windows, Linux/Unix, or IBM iSeries or z/OS. BFC includes facilities for distributed computing, batch processing, queuing, and database command scripting, and these run under Windows or Linux with Wine.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Eucalyptus is a paid and open-source computer software for building Amazon Web Services (AWS)-compatible private and hybrid cloud computing environments, originally developed by the company Eucalyptus Systems. Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems. Eucalyptus enables pooling compute, storage, and network resources that can be dynamically scaled up or down as application workloads change. Mårten Mickos was the CEO of Eucalyptus. In September 2014, Eucalyptus was acquired by Hewlett-Packard and then maintained by DXC Technology. After DXC stopped developing the product in late 2017, AppScale Systems forked the code and started supporting Eucalyptus customers.

gLite Grid computing software

gLite is a middleware computer software project for grid computing used by the CERN LHC experiments and other scientific domains. It was implemented by collaborative efforts of more than 80 people in 12 different academic and industrial research centers in Europe. gLite provides a framework for building applications tapping into distributed computing and storage resources across the Internet. The gLite services were adopted by more than 250 computing centres, and used by more than 15000 researchers in Europe and around the world.

In computer science, memory virtualization decouples volatile random access memory (RAM) resources from individual systems in the data center, and then aggregates those resources into a virtualized memory pool available to any computer in the cluster. The memory pool is accessed by the operating system or applications running on top of the operating system. The distributed memory pool can then be utilized as a high-speed cache, a messaging layer, or a large, shared memory resource for a CPU or a GPU application.

A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.

<span class="mw-page-title-main">Message passing in computer clusters</span> Aspect of computer clusters

Message passing is an inherent element of all computer clusters. All computer clusters, ranging from homemade Beowulfs to some of the fastest supercomputers in the world, rely on message passing to coordinate the activities of the many nodes they encompass. Message passing in computer clusters built with commodity servers and switches is used by virtually every internet service.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Computation offloading is the transfer of resource intensive computational tasks to a separate processor, such as a hardware accelerator, or an external platform, such as a cluster, grid, or a cloud. Offloading to a coprocessor can be used to accelerate applications including: image rendering and mathematical calculations. Offloading computing to an external platform over a network can provide computing power and overcome hardware limitations of a device, such as limited computational power, storage, and energy.

References

  1. "Release 3.7.2".
  2. "Rocks 7.0 is Released". 1 December 2017. Retrieved 17 November 2022.