MOSIX

Last updated
MOSIX
Developer(s) Amnon Barak [1]
Stable release
4.4.4 [2] / 24 October 2017;4 years ago (2017-10-24)
Operating system Linux
Type Cluster software
License own [3]
Website www.mosix.cs.huji.ac.il/index.html

MOSIX is a proprietary distributed operating system. [4] Although early versions were based on older UNIX systems, since 1999 it focuses on Linux clusters and grids. In a MOSIX cluster/grid there is no need to modify or to link applications with any library, to copy files or login to remote nodes, or even to assign processes to different nodes – it is all done automatically, like in an SMP.

Contents

History

MOSIX has been researched and developed since 1977 at The Hebrew University of Jerusalem by the research team of Prof. Amnon Barak. So far, ten major versions have been developed. The first version, called MOS, for Multicomputer OS, (1981–83) was based on Bell Lab's Seventh Edition Unix and ran on a cluster of PDP-11 computers. Later versions were based on Unix System V Release 2 (1987–89) and ran on a cluster of VAX and NS32332-based computers, followed by a BSD/OS-derived version (1991–93) for a cluster of 486/Pentium computers. Since 1999 MOSIX is tuned to Linux for x86 platforms.

MOSIX2

The second version of MOSIX, called MOSIX2, compatible with Linux-2.6 and 3.0 kernels. MOSIX2 is implemented as an OS virtualization layer that provides users and applications with a single system image with the Linux run-time environment. It allows applications to run in remote nodes as if they run locally. Users run their regular (sequential and parallel) applications while MOSIX transparently and automatically seek resources and migrate processes among nodes to improve the overall performance.

MOSIX2 can manage a cluster and a multicluster (grid) as well as workstations and other shared resources. Flexible management of a grid allows owners of clusters to share their computational resources, while still preserving their autonomy over their own clusters and their ability to disconnect their nodes from the grid at any time, without disrupting already running programs.

A MOSIX grid can extend indefinitely as long as there is trust between its cluster owners. This must include guarantees that guest applications will not be modified while running in remote clusters and that no hostile computers can be connected to the local network. Nowadays these requirements are standard within clusters and organizational grids.

MOSIX2 can run in native mode or in a virtual machine (VM). In native mode, performance is better, but it requires modifications to the base Linux kernel, whereas a VM can run on top of any unmodified operating system that supports virtualization, including Microsoft Windows, Linux and Mac OS X.

MOSIX2 is most suitable for running compute intensive applications with low to moderate amount of input/output (I/O). Tests of MOSIX2 show that the performance of several such applications over a 1 Gbit/s campus grid is nearly identical to that of a single cluster.[ citation needed ]

Main features

  • Provides aspects of a single-system image:
    • Users can login on any node and do not need to know where their programs run.
    • No need to modify or link applications with special libraries.
    • No need to copy files to remote nodes.
  • Automatic resource discovery and workload distribution by process migration:
    • Load-balancing.
    • Migrating processes from slower to faster nodes and from nodes that run out of free memory.
  • Migratable sockets for direct communication between migrated processes.
  • Secure run time environment (sandbox) for guest processes.
  • Live queuing – queued jobs preserve their full generic Linux environment.
  • Batch jobs.
  • Checkpoint and recovery.
  • Tools: automatic installation and configuration scripts, on-line monitors.

MOSIX for HPC

MOSIX is most suitable for running HPC applications with low to moderate amount of I/O. Tests of MOSIX show that the performance of several such applications over a 1 Gbit/s campus grid is nearly identical to that of a single cluster.[ citation needed ] It is particularly suitable for:

  • Efficient utilization of grid-wide resources, by automatic resource discovery and load-balancing.[ citation needed ]
  • Running applications with unpredictable resource requirements or run times.[ citation needed ]
  • Running long processes, which are automatically sent to grid nodes and are migrated back when these nodes are disconnected from the grid.[ citation needed ]
  • Combining nodes of different speeds, by migrating processes among nodes based on their respective speeds, current load, and available memory.[ citation needed ]

A few examples:

MOSIX4

MOSIX4 was released in July 2014. [2] As of version 4, MOSIX doesn't require kernel patching. [2]

openMosix

After MOSIX became proprietary software in late 2001, Moshe Bar forked the last free version and started the openMosix project on February 10, 2002. [5]

On July 15, 2007, Bar decided to end the openMosix project effective March 1, 2008, claiming that "the increasing power and availability of low cost multi-core processors is rapidly making single-system image (SSI) clustering less of a factor in computing". These plans were reconfirmed in March 2008. [6] The LinuxPMI project is continuing development of the former openMosix code.

Further reading

MOSIX4

MOSIX2 for Linux 2.6

MOSIX for Linux 2.2 & 2.4

MOSIX Version 1 book

Other

See also

Notes

  1. "MOSIX Frequently Asked Questions".
  2. 1 2 3 "MOSIX Changelog".
  3. www.mosix.cs.huji.ac.il/txt_distributions.html
  4. The MOSIX distributed operating system: Load balancing for UNIX, volume 672 of Lecture Notes in Computer Science. Springer-Verlag, New York, 1993
  5. the openMosix Project.
  6. "OpenMosix".

Related Research Articles

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal. Three significant challenges of distributed systems are: maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When a component of one system fails, the entire system does not fail. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

Load balancing (computing) Set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing refers to the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

Beowulf cluster Type of computing cluster

A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for the long running applications that are executed in the failure-prone computing systems.

openMosix Distributed operating system

openMosix was a free cluster management system that provided single-system image (SSI) capabilities, e.g. automatic work distribution among nodes. It allowed program processes to migrate to machines in the node's network that would be able to run that process faster. It was particularly useful for running parallel applications having low to moderate input/output (I/O). It was released as a Linux kernel patch, but was also available on specialized Live CDs. openMosix development has been halted by its developers, but the LinuxPMI project is continuing development of the former openMosix code.

OpenSSI is an open-source single-system image clustering system. It allows a collection of computers to be treated as one large system, allowing applications running on any one machine access to the resources of all the machines in the cluster.

In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented for more limited purposes, just job scheduling for instance, which may be achieved by means of an additional layer of software over conventional operating system images running on each node. The interest in SSI clusters is based on the perception that they may be simpler to use and administer than more specialized clusters.

Kerrighed is an open source single-system image (SSI) cluster software project. The project started in October 1998 at the Paris research group The French National Institute for Research in Computer Science and Control. From 2006 to 2011, the project was mainly developed by Kerlabs. In January, 2012 the Linux clustering mission of Kerlabs was adopted by a new company: We Cluster, Inc. headquartered in Pacific Grove, California. January 18, 2012: Kerrighed 3.0 has been ported to Ubuntu 12.04 with Linux Kernel v3.2.

GPFS is a high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 top500 list of supercomputers. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

Within cluster and parallel computing, a cluster manager is usually backend graphical user interface (GUI) or command-line interface (CLI) software that runs on a set of cluster nodes that it manages. The cluster manager works together with a cluster management agent. These agents run on each node of the cluster to manage and configure services, a set of services, or to manage and configure the complete cluster server itself In some cases the cluster manager is mostly used to dispatch work for the cluster to perform. In this last case a subset of the cluster manager can be a remote desktop application that is used not for configuration but just to send work and get back work results from a cluster. In other cases the cluster is more related to availability and load balancing than to computational or specific service clusters.

Computer cluster Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

NonStop Clusters (NSC) was an add-on package for SCO UnixWare that allowed creation of fault-tolerant single-system image clusters of machines running UnixWare. NSC was one of the first commercially available highly available clustering solutions for commodity hardware.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

Supercomputer architecture Aspect of supercomputer

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.

A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Nachum Dershowitz is an Israeli computer scientist, known e.g. for the Dershowitz–Manna ordering and the multiset path ordering used to prove termination of term rewrite systems.

Arun K. Somani is Associate Dean for Research of College of Engineering, Distinguished Professor of Electrical and Computer Engineering and Philip and Virginia Sproul Professor at Iowa State University. Somani is Elected Fellow of Institute of Electrical and Electronics Engineers (IEEE) for “contributions to theory and applications of computer networks” from 1999 to 2017 and Life Fellow of IEEE since 2018. He is Distinguished Engineer of Association for Computing Machinery(ACM) and Elected Fellow of The American Association for the Advancement of Science(AAAS).