Fail-fast system

Last updated October 17, 2024

In systems design, a fail-fast system is one that immediately reports at its interface any condition that is likely to indicate a failure. Fail-fast systems are usually designed to stop normal operation rather than attempt to continue a possibly flawed process. Such designs often check the system's state at several points in an operation, so that any failures can be detected early. The responsibility of a fail-fast module is detecting errors, and then letting the next-highest level of the system handle them.

Hardware and software

Fail-fast systems or modules are desirable in several circumstances:

Fail-fast architectures are based on an error handling policy where any detected error or non-contemplated state makes the system fail (fast). In some sense, the error-handling policy is the opposite of that used in a fault-tolerant system. In a fault-tolerant system, an error handling policy is established to have redundant components and move computation requests to alive components when some component fails. Paradoxically fail-fast systems make fault-tolerant systems more resilient. We can have 10 redundant servers for a given database, but if the shared configuration for the 10 servers is updated with wrong authentication data for clients, all of them will "redundantly fail". In that sense, a fail-fast system will make sure that all the 10 redundant servers fail as soon as possible to make the DevOps react fast.
Fail-fast components are often used in situations where failure in one component might not be visible until it leads to failure in another component as a consequence of lazy initialization. e.g. "The system that is "doomed" to fail because a file-system path is wrongly set up, does it not fail at startup because the file-system path is not checked at startup. Only when a client-request arrives the system fails, at random, later on.
Finding the cause of a failure is easier in a fail-fast system because the system reports the failure with as much information as possible as close to the time of failure as possible. In a fault-tolerant system, the failure might go undetected, whereas in a system that is neither fault-tolerant nor fail-fast, the failure might be temporarily hidden until it causes some seemingly unrelated problem later.
A fail-fast system that is designed to halt as well as report the error on failure is less likely to erroneously perform an irreversible or costly operation.

Developers also refer to code as fail-fast if it tries to fail as soon as possible at a variable or object initialization. In object-oriented programming, a fail-fast-designed object initializes the internal state of the object in the constructor, launching an exception if something is wrong (rather than allowing non-initialized or partially initialized objects that will fail later due to a wrong "setter"). The object can then be made immutable if no more changes to the internal state are expected. In functions, fail-fast code will check input parameters in the precondition. In client-server architectures, fail-fast will check the client request just upon arrival, before processing or redirecting it to other internal components, returning an error if the request fails (incorrect parameters, ...). Fail-fast-designed code decreases the internal software entropy and reduces debugging effort.

Examples

A fail-fast application/system checks that all input/output resources needed for future computations are ready before any computation request arrives.
A fail-fast application/system checks that all immutable initial configurations are correct at startup.
A fail-fast function is a function that checks all input to the function in a precondition before proceeding with any computation or business logic in such function.
A fail-fast function will normally throw a runtime exception when some abnormal computation is found, making the system fail if no "catch" has been contemplated by any other, versus returning some error value without making any (optimistic) assumption about the correct management of the raised error.
From the field of software engineering, a fail-fast iterator is an iterator that attempts to raise an error if the sequence of elements processed by the iterator is changed during iteration.
Given an initial state in a state machine, a fail-fast system will check such a state and fail fast.
Given a state-change in a state machine, the fail-fast system will halt the machine if the state-change is forbidden. It could be the case that the forbidden state-change is due to a wrong external input. In that case, the fail-fast system will stop processing the request as soon as the wrong input is detected (vs. delegating to the state-machine implementation).

Business

The term has been widely employed as a metaphor in business, dating back to at least 2001,^[1] meaning that businesses should undertake bold experiments to determine the long-term viability of a product or strategy, rather than proceeding cautiously and investing years in a doomed approach. It became adopted as a kind of "mantra" within startup culture, i.e. "Fail fast, fail often."^[2]

Related Research Articles

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and no data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols.

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

Stress testing is a software testing activity that determines the robustness of software by testing beyond the limits of normal operation. Stress testing is particularly important for "mission critical" software, but is used for all types of software. Stress tests commonly put a greater emphasis on robustness, availability, and error handling under a heavy load, than on what would be considered correct behavior under normal circumstances.

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Fault Tolerant Messaging in the context of computer systems and networks, refers to a design approach and set of techniques aimed at ensuring reliable and continuous communication between components or nodes even in the presence of errors or failures. This concept is especially critical in distributed systems, where components may be geographically dispersed and interconnected through networks, making them susceptible to various potential points of failure.

Triconex is a Schneider Electric brand that supplies products, systems, and services for safety, critical control, and turbo-machinery applications. Triconex also use its name for its hardware devices that use its TriStation application software. Triconex products are based on patented Triple modular redundancy (TMR) industrial safety-shutdown technology. Today, Triconex TMR products operate globally in more than 11,500 installations.

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

The Application Interface Specification (AIS) is a collection of open specifications that define the application programming interfaces (APIs) for high-availability application computer software. It is developed and published by the Service Availability Forum and made freely available. Besides reducing the complexity of high-availability applications and shortening development time, the specifications intended to ease the portability of applications between different middleware implementations and to admit third party developers to a field that was highly proprietary in the past.

Circuit breaker is a design pattern used in software development. It is used to detect failures and encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. Circuit breaker pattern prevents cascading failures particularly in distributed systems.

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.

Quantcast File System (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to the Apache Hadoop Distributed File System (HDFS), intended to deliver better performance and cost-efficiency for large-scale processing clusters.

References

↑ Khanna, Rajat; Guler, Isin; Nerkar, Atul (2016-04-01). "Fail Often, Fail Big, and Fail Fast? Learning from Small Failures and R&D Performance in the Pharmaceutical Industry". Academy of Management Journal. 59 (2): 436–459. doi:10.5465/amj.2013.1109. ISSN 0001-4273.
↑ Surowiecki, James. "Epic Fails of the Startup World". The New Yorker. Retrieved 2017-08-14.

External links

Gray, Jim (1985). "Why Do Computers Stop and What Can Be Done About It?". CiteSeerX 10.1.1.110.9127 , introducing 'Fail Fast'
"Fail Fast" Article by Jim Shore explaining using 'Fail Fast' concept in software development (from 'columns for IEEE software' edited by Martin Fowler)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Khanna, Rajat; Guler, Isin; Nerkar, Atul (2016-04-01). "Fail Often, Fail Big, and Fail Fast? Learning from Small Failures and R&D Performance in the Pharmaceutical Industry". Academy of Management Journal. 59 (2): 436–459. doi:10.5465/amj.2013.1109. ISSN 0001-4273.

[2] Surowiecki, James. "Epic Fails of the Startup World". The New Yorker. Retrieved 2017-08-14.

[1]

[2]