Single point of failure

Last updated
In this diagram the router is a single point of failure for the communication network between computers Single Point of Failure.png
In this diagram the router is a single point of failure for the communication network between computers

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. [1] SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

Contents

Overview

Systems can be made robust by adding redundancy in all potential SPOFs. For instance, the owner of a small tree care company may only own one woodchipper. If the chipper breaks, he may be unable to complete his current job and may have to cancel future jobs until he can obtain a replacement.

Redundancy can be achieved at various levels. For instance, the owner of the tree care company may have spare parts ready for the repair of the wood chipper, in case it fails. At a higher level, he may have a second wood chipper that he can bring to the job site. Finally, at the highest level, he may have enough equipment available to completely replace everything at the work site in the case of multiple failures.

The assessment of a potential SPOF involves identifying the critical components of a complex system that would provoke a total systems failure in case of malfunction. Highly reliable systems should not rely on any such individual component.

Computing

In computing, redundancy can be achieved at the internal component level, at the system level (multiple machines), or site level (replication).

One would normally deploy a load balancer to ensure high availability for a server cluster at the system level.

In a high-availability server cluster, each individual server may attain internal component redundancy by having multiple power supplies, hard drives, and other components. System level redundancy could be obtained by having spare servers waiting to take on the work of another server if it fails.

Since a data center is often a support center for other operations such as business logic, it represents a potential SPOF in itself. Thus, at the site level, the entire cluster may be replicated at another location, where it can be accessed in case the primary location becomes unavailable. This is typically addressed as part of an IT disaster recovery (resiliency) program.

Paul Baran and Donald Davies developed packet switching, a key part of "survivable communications networks". Such networks  including ARPANET and the Internet   are designed to have no single point of failure. Multiple paths between any two points on the network allow those points to continue communicating with each other, the packets "routing around" damage, even after any single failure of any one particular path or any one intermediate node.

Network protocols used to prevent SPOF:

Software engineering

In software engineering, a bottleneck occurs when the capacity of an application or a computer system is severely limited by a single component. The bottleneck has lowest throughput of all parts of the transaction path.

Performance engineering

Tracking down bottlenecks (sometimes known as "hot spots" - sections of the code that execute most frequently - i.e. have the highest execution count) is called performance analysis. Reduction is usually achieved with the help of specialized tools, known as performance analyzers or profilers. The objective being to make those particular sections of code perform as fast as possible to improve overall algorithmic efficiency.

Computer security

A mistake in just one component can compromise the entire system.

Other fields

The concept of a single point of failure has also been applied to fields outside of engineering, computers, and networking, such as corporate supply chain management [2] and transportation management. [3]

Design structures that create single points of failure include bottlenecks and series circuits (in contrast to parallel circuits).

In transportation, some noted recent examples of the concept's recent application have included the Nipigon River Bridge in Canada, where a partial bridge failure in January 2016 entirely severed road traffic between Eastern Canada and Western Canada for several days because it is located along a portion of the Trans-Canada Highway where there is no alternate detour route for vehicles to take; [4] and the Norwalk River Railroad Bridge in Norwalk, Connecticut, an aging swing bridge that sometimes gets stuck when opening or closing, disrupting rail traffic on the Northeast Corridor line. [3]

The concept of a single point of failure has also been applied to the fields of intelligence. Edward Snowden talked of the dangers of being what he described as "the single point of failure" – the sole repository of information. [5]

Life support systems

A component of a life support system which would constitute a single point of failure would be required to be extremely reliable.

See also

Concepts

  • Redundancy   Duplication of critical components to increase reliability of a system
  • Bus factor   A measurement of the risk of losing key technical experts
  • Lusser's law   The probability product law of series components

Applications

  • Kill switch   Safety mechanism to quickly shut down a system
  • Reliability engineering   Sub-discipline of systems engineering that emphasizes dependability in the lifecycle management of a product or a system
  • Safety engineering   Engineering discipline which assures that engineered systems provide acceptable levels of safety

In literature

  • Achilles' heel   Critical weakness which can lead to downfall in spite of overall strength
  • Hamartia   Protagonist's error in Greek dramatic theory

Related Research Articles

Backplane PCB containing connectors for daughterboards electrically linked pin-by-pin

A backplane is a group of electrical connectors in parallel with each other, so that each pin of each connector is linked to the same relative pin of all the other connectors, forming a computer bus. It is used as a backbone to connect several printed circuit boards together to make up a complete computer system. Backplanes commonly use a printed circuit board, but wire-wrapped backplanes have also been used in minicomputers and high-reliability applications.

Safety engineering Engineering discipline which assures that engineered systems provide acceptable levels of safety

Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.

Network topology arrangement of the various elements of a computer network; topological structure of a network and may be depicted physically or logically

Network topology is the arrangement of the elements of a communication network. Network topology can be used to define or describe the arrangement of various types of telecommunication networks, including command and control radio networks, industrial fieldbusses and computer networks.

Server (computing) Computer to access a central resource or service on a network

In computing, a server is a computer program or a device that provides functionality for other programs or devices, called "clients". This architecture is called the client–server model, and a single overall computation is distributed across multiple processes or devices. Servers can provide various functionalities, often called "services", such as sharing data or resources among multiple clients, or performing computation for a client. A single server can serve multiple clients, and a single client can use multiple servers. A client process may run on the same device or may connect over a network to a server on a different device. Typical servers are database servers, file servers, mail servers, print servers, web servers, game servers, and application servers.

Load balancing (computing) set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing refers to the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing techniques can optimise the response time for each task, avoiding unevenly overloading compute nodes while other compute nodes are left idle.

System on a chip type of integrated circuit

A system on chip is an integrated circuit that integrates all or most components of a computer or other electronic system. These components almost always include a central processing unit (CPU), memory, input/output ports and secondary storage – all on a single substrate or microchip, the size of a coin. It must contain digital, analog, mixed-signal, and often radio frequency signal processing functions, otherwise it will only be considered as an "Application Processor". As they are integrated on a single substrate, SoCs consume much less power and take up much less area than multi-chip designs with equivalent functionality. Because of this, SoCs are very common in the mobile computing and edge computing markets. Systems-on-chip are typically fabricated using metal–oxide–semiconductor (MOS) technology, and are commonly used in embedded systems and the Internet of Things. Higher-performance SoCs are often paired with dedicated and physically separate memory and secondary storage chips, that may be layered on top of the SoC in what's known as a Package on package (PoP) configuration.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett Packard into HP Inc. and Hewlett Packard Enterprise.

Reliability engineering is a sub-discipline of systems engineering that emphasizes dependability in the lifecycle management of a product. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Distributed File System (DFS) is a set of client and server services that allow an organization using Microsoft Windows servers to organize many distributed SMB file shares into a distributed file system. DFS has two components to its service: Location transparency and Redundancy. Together, these components improve data availability in the case of failure or heavy load by allowing shares in multiple different locations to be logically grouped under one folder, the "DFS root".

Redundancy (engineering) Duplication of critical components to increase reliability of a system

In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

A grid file system is a computer file system whose goal is improved reliability and availability by taking advantage of many smaller file storage areas.

N+1 redundancy is a form of resilience that ensures system availability in the event of component failure. Components have at least one independent backup component (+1). The level of resilience is referred to as active/passive or standby as backup components do not actively participate within the system during normal operation. The level of transparency during failover is dependent on a specific solution, though degradation to system resilience will occur during failover.

A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

Computer cluster group of computers

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

Storage area network Network which provides access to consolidated, block-level data storage

A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to enhance accessibility of storage devices, such as disk arrays and tape libraries, to servers so that the devices appear to the operating system as locally-attached devices. A SAN typically is a dedicated network of storage devices not accessible through the local area network (LAN) by other devices, thereby preventing interference of LAN traffic in data transfer.

A defence in depth uses multi-layered protections, similar to redundant protections. The intention is to create a reliable system using the multiple layers, rather than making any one layer perfectly reliable.

References

  1. 1: Designing Large-scale LANs – Page 31, K. Dooley, O'Reilly, 2002
  2. Gary S. Lynch (Oct 7, 2009). Single Point of Failure: The 10 Essential Laws of Supply Chain Risk Management. Wiley. ISBN   978-0-470-42496-4.
  3. 1 2 "Crucial, Century-Old, And Sometimes Stuck: Connecticut Bridge Is Key To Northeast Corridor". Connecticut Public Radio, August 8, 2017.
  4. "The Nipigon River Bridge and other Trans-Canada bottlenecks". Global News, January 11, 2016.
  5. "Edward Snowden: the true story behind his NSA leaks". Telegraph.co.uk. Retrieved 2016-12-13.