Hot spare

Last updated

A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

Contents

Examples

Examples of hot spares are components such as A/V switches, computers, network printers, and hard disks. The equipment is powered on, or considered "hot," but not actively functioning in (i.e. used by) the system.

Electrical generators may be held on hot standby, or a steam train may be held at the shed fired up (literally hot) ready to replace a possible failure of an engine in service.

Explanation

In designing a reliable system, it is recognized that there will be failures. At the extreme, a complete system can be duplicated and kept up to date—so in the event of the primary system failing, the secondary system can be switched in with little or no interruption. More often, a hot spare is a single vital component without which the entire system would fail. The spare component is integrated into the system in such a way that in the event of a problem, the system can be altered to use the spare component. This may be done automatically or manually, but in either case it is normal to have some means of error detection. A hot spare does not necessarily give 100% availability or protect against temporary loss of the system during the switching process; it is designed to significantly reduce the time that the system is unavailable.

Hot standby may have a slightly different connotation of being active but not productive to hot spare, that is it is a state rather than object. For example, in a national power grid, the supply of power needs to be balanced to demand over a short term. It can take many hours to bring a coal-fired power station up to productive temperatures. To allow for load balancing, generator turbines may be kept running with the generators switched off so as peaks of demand occur, the generators can rapidly be switched on to balance the load. Being in the state of being ready to run is known as hot standby. Though it is not a modern phenomenon, steam train operators might hold a spare steam engine at a terminus fired up, as starting an engine cold would take a significant amount of time.

The spare may be similar component or system, or it may be a system of reduced performance, designed to cope for the duration of the time to repair and recover the original component. In high availability systems, it is common to design so that not only is there a spare that can quickly be switched in, but also that the failed component can be repaired or replaced without stopping the system - this is known as hot swapping. It may be considered that the probability of a second failure is low, and therefore the system is designed simply to allow operation to continue until a suitable maintenance period. The appropriate solution is normally determined by balancing the costs of implementing the availability against the likelihood of a problem and the severity of that problem. there are two types of hot standby: 1. hot standby master - slave 2. hot standby in shearing mode

Computer usage

A hot spare disk is a disk or group of disks used to automatically or manually, depending upon the hot spare policy, replace a failing or failed disk in a RAID configuration. The hot spare disk reduces the mean time to recovery (MTTR) for the RAID redundancy group, thus reducing the probability of a second disk failure and the resultant data loss that would occur in any singly redundant RAID (e.g., RAID-1, RAID-5, RAID-10). Typically, a hot spare is available to replace a number of different disks and systems employing a hot spare normally require a redundant group to allow time for the data to be generated onto the spare disk. During this time the system is exposed to data loss due to a subsequent failure, and therefore the automatic switching to a spare disk reduces the time of exposure to that risk compared to manual discovery and implementation.

The concept of hot spares is not limited to hardware, but also software systems can be held in a state of readiness, for example a database server may have a software copy on hot standby, possibly even on the same machine to cope with the various factors that make a database unreliable, such as the impact of disc failure, poorly written queries or database software errors.

Hot standby operation in railway signalling

At least two units of the same type will be powered up, receiving the same set of inputs, performing identical computations and producing identical outputs in a nearly-synchronous manner. The outputs are typically physical outputs (individual ON/OFF type digital signals, or analog signals), or serial data messages wrapped in suitable protocols depending upon the nature of their intended use. Outputs from only one unit (designated as the master or on-line unit, via application logic) are used to control external devices (such as switches, signals, on-board propulsion/braking control devices, etc.) or simply to provide displays. The other unit is a hot-standby or a hot spare unit, ready to take over if the master unit fails. When the master unit fails, an automatic failover to the hot spare occurs within a very short time and the outputs from the hot spare, now the master unit, are delivered to the controlled devices and displays. The controlled devices and displays may experience a short blip or disturbance during the failover time. However, they can be designed to tolerate/ignore the disturbances so that the overall system operation is not affected.

Hot standby operation of vacuum tubes

This means that a device, or section of a device, that may need to be activated instantly, is kept with the vacuum tubes (pre-)heated but the anode voltage supply switched off. This causes normal cathode coatings to fail prematurely. [1]

See also

Related Research Articles

<span class="mw-page-title-main">Computer data storage</span> Storage of digital data readable by computers

Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.

<span class="mw-page-title-main">Motherboard</span> Main printed circuit board (PCB) for a computing device

A motherboard is the main printed circuit board (PCB) in general-purpose computers and other expandable systems. It holds and allows communication between many of the crucial electronic components of a system, such as the central processing unit (CPU) and memory, and provides connectors for other peripherals. Unlike a backplane, a motherboard usually contains significant sub-systems, such as the central processor, the chipset's input/output and memory controllers, interface connectors, and other components integrated for general use.

<span class="mw-page-title-main">Vacuum tube</span> Device that controls current between electrodes

A vacuum tube, electron tube, valve, or tube, is a device that controls electric current flow in a high vacuum between electrodes to which an electric potential difference has been applied.

<span class="mw-page-title-main">Uninterruptible power supply</span> Electrical device that uses batteries to prevent any interruption of power flow

An uninterruptible power supply (UPS) or uninterruptible power source is a type of continual power system that provides automated backup electric power to a load when the input power source or mains power fails. A UPS differs from a traditional auxiliary/emergency power system or standby generator in that it will provide near-instantaneous protection from input power interruptions by switching to energy stored in battery packs, supercapacitors or flywheels. The on-battery run-times of most UPSs are relatively short but sufficient to "buy time" for initiating a standby power source or properly shutting down the protected equipment. Almost all UPSs also contain integrated surge protection to shield the output appliances from voltage spikes.

<span class="mw-page-title-main">Hot swapping</span> Concept in computing

Hot swapping is the replacement or addition of components to a computer system without stopping, shutting down, or rebooting the system; hot plugging describes the addition of components only. Components which have such functionality are said to be hot-swappable or hot-pluggable; likewise, components which do not are cold-swappable or cold-pluggable.

<span class="mw-page-title-main">Voltage regulator</span> System designed to maintain a constant voltage

A voltage regulator is a system designed to automatically maintain a constant voltage. It may use a simple feed-forward design or may include negative feedback. It may use an electromechanical mechanism, or electronic components. Depending on the design, it may be used to regulate one or more AC or DC voltages.

<span class="mw-page-title-main">Failover</span> Automatic switching from failed computer system to standby computers

Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention.

<span class="mw-page-title-main">DMS-100</span> Nortel telecom switch

The DMS-100 is a member of the Digital Multiplex System (DMS) product line of telephone exchange switches manufactured by Northern Telecom. Designed during the 1970s and released in 1979, it can control 100,000 telephone lines.

An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. In-memory databases are faster than disk-optimized databases because disk access is slower than memory access and the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.

Fault tolerance is the resilient property that enables a system to continue operating properly in the event of failure or major dysfunction in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

<span class="mw-page-title-main">Emergency power system</span>

An emergency power system is an independent source of electrical power that supports important electrical systems on loss of normal power supply. A standby power system may include a standby generator, batteries and other apparatus. Emergency power systems are installed to protect life and property from the consequences of loss of primary electric power supply. It is a type of continual power system.

<span class="mw-page-title-main">Disk mirroring</span>

In data storage, disk mirroring is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability. It is most commonly used in RAID 1. A mirrored volume is a complete logical representation of separate volume copies.

In Electrical Power Systems and Industrial Automation, ANSI Device Numbers can be used to identify equipment and devices in a system such as relays, circuit breakers, or instruments. The device numbers are enumerated in ANSI/IEEE Standard C37.2 "Standard for Electrical Power System Device Function Numbers, Acronyms, and Contact Designations".

<span class="mw-page-title-main">Power supply unit (computer)</span> Internal computer component that provides power to other components

A power supply unit (PSU) converts mains AC to low-voltage regulated DC power for the internal components of a computer. Modern personal computers universally use switched-mode power supplies. Some power supplies have a manual switch for selecting input voltage, while others automatically adapt to the main voltage.

Redundancy is a form of resilience that ensures system availability in the event of component failure. Components have at least one independent backup component (+1). The level of resilience is referred to as active/passive or standby as backup components do not actively participate within the system during normal operation. The level of transparency during failover is dependent on a specific solution, though degradation to system resilience will occur during failover.

engine-generator Combination of an electrical generator and an engine in a single part

An engine–generator is the combination of an electrical generator and an engine mounted together to form a single piece of equipment. This combination is also called an engine–generator set or a gen-set. In many contexts, the engine is taken for granted and the combined unit is simply called a generator. An engine–generator may be a fixed installation, part of a vehicle, or made small enough to be portable.

Device Mapper Multipath Input Output often shortened to DM-Multipathing and abbreviated as DM-MPIO provides input-output (I/O) fail-over and load-balancing by using multipath I/O within Linux for block devices. By utilizing device-mapper, the multipathd daemon provides the host-side logic to use multiple paths of a redundant network to provide continuous availability and higher-bandwidth connectivity between the host server and the block-level device. DM-MPIO handles the rerouting of block I/O to an alternate path in the event of a path failure. DM-MPIO can also balance the I/O load across all of the available paths that are typically utilized in Fibre Channel (FC) and iSCSI SAN environments. DM-MPIO is based on the device mapper, which provides the basic framework that maps one block device onto another.

This glossary of electrical and electronics engineering is a list of definitions of terms and concepts related specifically to electrical engineering and electronics engineering. For terms related to engineering in general, see Glossary of engineering.

References