Single point of failure

Last updated March 26, 2024

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working.^[1] SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

Overview

Systems can be made robust by adding redundancy in all potential SPOFs. Redundancy can be achieved at various levels.

The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of malfunction. Highly reliable systems should not rely on any such individual component.

For instance, the owner of a small tree care company may only own one woodchipper. If the chipper breaks, they may be unable to complete their current job and may have to cancel future jobs until they can obtain a replacement. The owner of the tree care company may have spare parts ready for the repair of the wood chipper, in case it fails. At a higher level, they may have a second wood chipper that they can bring to the job site. Finally, at the highest level, they may have enough equipment available to completely replace everything at the work site in the case of multiple failures.

Possible SPOFs in a simple setup
Using redundancy to avoid some SPOFs
Completely redundant system without SPOFs (note: assumes generator and grid sources are each rated at N, each UPS is rated at N, and "A/C" and "Electrical" are in and of themselves completely fault tolerant systems)

Computing

A fault-tolerant computer system can be achieved at the internal component level, at the system level (multiple machines), or site level (replication).

One would normally deploy a load balancer to ensure high availability for a server cluster at the system level. In a high-availability server cluster, each individual server may attain internal component redundancy by having multiple power supplies, hard drives, and other components. System-level redundancy could be obtained by having spare servers waiting to take on the work of another server if it fails.

Since a data center is often a support center for other operations such as business logic, it represents a potential SPOF in itself. Thus, at the site level, the entire cluster may be replicated at another location, where it can be accessed in case the primary location becomes unavailable. This is typically addressed as part of an IT disaster recovery program.

Paul Baran and Donald Davies developed packet switching, a key part of "survivable communications networks". Such networks – including ARPANET and the Internet – are designed to have no single point of failure. Multiple paths between any two points on the network allow those points to continue communicating with each other, the packets "routing around" damage, even after any single failure of any one particular path or any one intermediate node.

Software engineering

In software engineering, a bottleneck occurs when the capacity of an application or a computer system is limited by a single component. The bottleneck has lowest throughput of all parts of the transaction path.

Performance engineering

Tracking down bottlenecks (sometimes known as hot spots – sections of the code that execute most frequently – i.e., have the highest execution count) is called performance analysis. Reduction is usually achieved with the help of specialized tools, known as performance analyzers or profilers. The objective is to make those particular sections of code perform as fast as possible to improve overall algorithmic efficiency.

Computer security

A vulnerability or security exploit in just one component can compromise an entire system.

Other fields

The concept of a single point of failure has also been applied to fields outside of engineering, computers, and networking, such as corporate supply chain management^[2] and transportation management.^[3]

Design structures that create single points of failure include bottlenecks and series circuits (in contrast to parallel circuits).

In transportation, some noted recent examples of the concept's recent application have included the Nipigon River Bridge in Canada, where a partial bridge failure in January 2016 entirely severed road traffic between Eastern Canada and Western Canada for several days because it is located along a portion of the Trans-Canada Highway where there is no alternate detour route for vehicles to take;^[4] and the Norwalk River Railroad Bridge in Norwalk, Connecticut, an aging swing bridge that sometimes gets stuck when opening or closing, disrupting rail traffic on the Northeast Corridor line.^[3]

The concept of a single point of failure has also been applied to the fields of intelligence. Edward Snowden talked of the dangers of being what he described as "the single point of failure" – the sole repository of information.^[5]

Life-support systems

A component of a life-support system that would constitute a single point of failure would be required to be extremely reliable.

Related Research Articles

<span class="mw-page-title-main">Backplane</span> Group of electrical connectors specifically aligned

A backplane or backplane system is a group of electrical connectors in parallel with each other, so that each pin of each connector is linked to the same relative pin of all the other connectors, forming a computer bus. It is used to connect several printed circuit boards together to make up a complete computer system. Backplanes commonly use a printed circuit board, but wire-wrapped backplanes have also been used in minicomputers and high-reliability applications.

Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.

<span class="mw-page-title-main">Server (computing)</span> Computer to access a central resource or service on a network

In computing, a server is a piece of computer hardware or software that provides functionality for other programs or devices, called "clients". This architecture is called the client–server model. Servers can provide various functionalities, often called "services", such as sharing data or resources among multiple clients or performing computations for a client. A single server can serve multiple clients, and a single client can use multiple servers. A client process may run on the same device or may connect over a network to a server on a different device. Typical servers are database servers, file servers, mail servers, print servers, web servers, game servers, and application servers.

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

Scalability is the property of a system to handle a growing amount of work. One definition for software systems specifies that this may be done by adding resources to the system.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

In database systems, durability is the ACID property that guarantees that the effects of transactions that have been committed will survive permanently, even in case of failures, including incidents and catastrophic events. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.

<span class="mw-page-title-main">Failover</span> Automatic switching from failed computer system to standby computers

Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer network. Failover and switchover are essentially the same operation, except that failover is automatic and usually operates without warning, while switchover requires human intervention.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead.

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Redundancy is a form of resilience that ensures system availability in the event of component failure. Components have at least one independent backup component (+1). The level of resilience is referred to as active/passive or standby as backup components do not actively participate within the system during normal operation. The level of transparency during failover is dependent on a specific solution, though degradation to system resilience will occur during failover.

A clustered file system (CFS) is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

References

↑ 1: Designing Large-scale LANs – Page 31, K. Dooley, O'Reilly, 2002
↑ Gary S. Lynch (Oct 7, 2009). Single Point of Failure: The 10 Essential Laws of Supply Chain Risk Management. Wiley. ISBN 978-0-470-42496-4.
1 2 "Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor". Connecticut Public Radio, August 8, 2017.
↑ "The Nipigon River Bridge and other Trans-Canada bottlenecks". Global News, January 11, 2016.
↑ "Edward Snowden: the true story behind his NSA leaks" . Telegraph.co.uk. Archived from the original on 2022-01-12. Retrieved 2016-12-13.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] 1: Designing Large-scale LANs – Page 31, K. Dooley, O'Reilly, 2002

[2] Gary S. Lynch (Oct 7, 2009). Single Point of Failure: The 10 Essential Laws of Supply Chain Risk Management. Wiley. ISBN 978-0-470-42496-4.

[crucialstuck-3] 1 2 "Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor". Connecticut Public Radio, August 8, 2017.

[4] "The Nipigon River Bridge and other Trans-Canada bottlenecks". Global News, January 11, 2016.

[5] "Edward Snowden: the true story behind his NSA leaks" . Telegraph.co.uk. Archived from the original on 2022-01-12. Retrieved 2016-12-13.

[1]

[2]

[3]

[4]

[5]

Single point of failure

Contents

Overview

Computing

Software engineering

Performance engineering

Computer security

Other fields

Life-support systems

See also

Concepts

Applications

In literature

Related Research Articles

References