Failover

Last updated

Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, [1] server, system, hardware component, or network in a computer network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention.

Contents

4G cellular failover for network resiliency Duckfone.png
4G cellular failover for network resiliency

Systems designers usually provide failover capability in servers, systems or networks requiring near-continuous availability and a high degree of reliability.

At the server level, failover automation usually uses a "heartbeat" system that connects two servers, either through using a separate cable (for example, RS-232 serial ports/cable) or a network connection. In the most common design, as long as a regular "pulse" or "heartbeat" continues between the main server and the second server, the second server will not bring its systems online; however a few systems actively use all servers and can failover their work to remaining servers after a failure. There may also be a third "spare parts" server that has running spare components for "hot" switching to prevent downtime. The second server takes over the work of the first as soon as it detects an alteration in the "heartbeat" of the first machine. Some systems have the ability to send a notification of failover.

Certain systems, intentionally, do not failover entirely automatically, but require human intervention. This "automated with manual approval" configuration runs automatically once a human has approved the failover.

Failback is the process of restoring a system, component, or service previously in a state of failure back to its original, working state, and having the standby system go from functioning back to standby.

The use of virtualization software has allowed failover practices to become less reliant on physical hardware through the process referred to as migration in which a running virtual machine is moved from one physical host to another, with little or no disruption in service.

Failover and Failback technology are also regularly used in the Microsoft SQL Server database, in which SQL Server Failover Cluster Instance (FCI) is installed/configured on top of the Windows Server failover Cluster (WSFC). The SQL Server groups and resources running on WSFC can manually be failover to the second node for any planned maintenance on the first node OR automatically failover to the second node in case of any issues on the first node. In the same way, a failback operation can be performed to the first node once the issue is resolved or maintenance is done on it.

History

The term "failover", although probably in use by engineers much earlier, can be found in a 1962 declassified NASA report. [2] The term "switchover" can be found in the 1950s [3] when describing '"Hot" and "Cold" Standby Systems', with the current meaning of immediate switchover to a running system (hot) and delayed switchover to a system that needs starting (cold). A conference proceedings from 1957 describes computer systems with both Emergency Switchover (i.e. failover) and Scheduled Failover (for maintenance). [4]

See also

Related Research Articles

<span class="mw-page-title-main">Beowulf cluster</span> Type of computing cluster

A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

MySQL Cluster is a technology providing shared-nothing clustering and auto-sharding for the MySQL database management system. It is designed to provide high availability and high throughput with low latency, while allowing for near linear scalability. MySQL Cluster is implemented through the NDB or NDBCLUSTER storage engine for MySQL.

Network Load Balancing Services (NLBS) is a Microsoft implementation of clustering and load balancing that is intended to provide high availability and high reliability, as well as high scalability. NLBS is intended for applications with relatively small data sets that rarely change, and do not have long-running in-memory states. These types of applications are called stateless applications, and typically include Web, File Transfer Protocol (FTP), and virtual private networking (VPN) servers. Every client request to a stateless application is a separate transaction, so it is possible to distribute the requests among multiple servers to balance the load. One attractive feature of NLBS is that all servers in a cluster monitor each other with a heartbeat signal, so there is no single point of failure.

Microsoft Cluster Server (MSCS) is a computer program that allows server computers to work together as a computer cluster, to provide failover and increased availability of applications, or parallel calculating power in case of high-performance computing (HPC) clusters.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

<span class="mw-page-title-main">Solaris Cluster</span> High-availability cluster software

Oracle Solaris Cluster is a high-availability cluster software product for Solaris, originally created by Sun Microsystems, which was acquired by Oracle Corporation in 2010. It is used to improve the availability of software services such as databases, file sharing on a network, electronic commerce websites, or other applications. Sun Cluster operates by having redundant computers or nodes where one or more computers continue to provide service if another fails. Nodes may be located in the same data center or on different continents.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

The IBM SAN Volume Controller (SVC) is a block storage virtualization appliance that belongs to the IBM System Storage product family. SVC implements an indirection, or "virtualization", layer in a Fibre Channel storage area network (SAN).

The software which Oracle Corporation markets as Oracle Data Guard forms an extension to the Oracle relational database management system (RDBMS). It aids in establishing and maintaining secondary standby databases as alternative/supplementary repositories to production primary databases.

Switchover is the manual switch from one system to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active server, system, or network, or to perform system maintenance, such as installing patches, and upgrading software or hardware.

OpenSAF is an open-source service-orchestration system for automating computer application deployment, scaling, and management. OpenSAF is consistent with, and expands upon, Service Availability Forum (SAF) and SCOPE Alliance standards.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

A cluster-aware application is a software application designed to call cluster APIs in order to determine its running state, in case a manual failover is triggered between cluster nodes for planned technical maintenance, or an automatic failover is required, if a computing cluster node encounters hardware or software failure, to maintain business continuity. A cluster-aware application may be capable of failing over LAN or WAN.

CUBRID ( "cube-rid") is an open-source SQL-based relational database management system (RDBMS) with object extensions developed by CUBRID Corp. for OLTP. The name CUBRID is a combination of the two words cube and bridge, cube standing for a space for data and bridge standing for data bridge.

Device Mapper Multipath Input Output often shortened to DM-Multipathing and abbreviated as DM-MPIO provides input-output (I/O) fail-over and load-balancing by using multipath I/O within Linux for block devices. By utilizing device-mapper, the multipathd daemon provides the host-side logic to use multiple paths of a redundant network to provide continuous availability and higher-bandwidth connectivity between the host server and the block-level device. DM-MPIO handles the rerouting of block I/O to an alternate path in the event of a path failure. DM-MPIO can also balance the I/O load across all of the available paths that are typically utilized in Fibre Channel (FC) and iSCSI SAN environments. DM-MPIO is based on the device mapper, which provides the basic framework that maps one block device onto another.

HP ConvergedSystem is a portfolio of system-based products from Hewlett-Packard (HP) that integrates preconfigured IT components into systems for virtualization, cloud computing, big data, collaboration, converged management, and client virtualization. Composed of servers, storage, networking, and integrated software and services, the systems are designed to address the cost and complexity of data center operations and maintenance by pulling the IT components together into a single resource pool so they are easier to manage and faster to deploy. Where previously it would take three to six months from the time of order to get a system up and running, it now reportedly takes as few as 20 days with the HP ConvergedSystem.

In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system. Heartbeat mechanism is one of the common techniques in mission critical systems for providing high availability and fault tolerance of network services by detecting the network or systems failures of nodes or daemons which belongs to a network cluster—administered by a master server—for the purpose of automatic adaptation and rebalancing of the system by using the remaining redundant nodes on the cluster to take over the load of failed nodes for providing constant services. Usually a heartbeat is sent between machines at a regular interval in the order of seconds; a heartbeat message. If the endpoint does not receive a heartbeat for a time—usually a few heartbeat intervals—the machine that should have sent the heartbeat is assumed to have failed. Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown. When the destination identifies a lack of heartbeat messages during an anticipated arrival period, the destination may determine that the originator has failed, shutdown, or is generally no longer available.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

References

  1. For application-level failover, see for example Jayaswal, Kailash (2005). "27". Administering Data Centers: Servers, Storage, And Voice Over IP. Wiley-India. p. 364. ISBN   978-81-265-0688-0 . Retrieved 2009-08-07. Although it is impossible to prevent some data loss during an application failover, certain steps can [...] minimize it..
  2. NASA Postlaunch Memorandum Report for Mercury-Atlas, June 15, 1962.
  3. Petroleum Engineer for Management - Volume 31 - Page D-40
  4. Proceedings of the Western Joint Computer Conference, Macmillan 1957