Software fault tolerance

Last updated

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures. [1] [2]

Contents

Following design patterns should be combined together to make the system more fault tolerant: retry, fallback, timeout, circuit breaker, and bulkhead pattern. [3] [4]

To make your system more fault tolerant, you should measure 99th percentile latency and keep the remaining 1% (aka tail latencies) in check through self healing mechanisms. [5]

Introduction

The only thing constant is change. This is certainly more true of software systems than almost any phenomenon, [6] not all software change in the same way so software fault tolerance methods are designed to overcome execution errors by modifying variable values to create an acceptable program state. [7] The need to control software fault is one of the most rising challenges facing software industries today. Fault tolerance must be a key consideration in the early stage of software development.

There exist different mechanisms for software fault tolerance, among which:

Operating system failure

Computer applications make a call using the application programming interface (API) to access shared resources, like the keyboard, mouse, screen, disk drive, network, and printer. These can fail in two ways.

Blocked calls

A blocked call is a request for services from the operating system that halts the computer program until results are available.

As an example, the TCP call blocks until a response becomes available from a remote server. This occurs every time you perform an action with a web browser. Intensive calculations cause lengthy delays with the same effect as a blocked API call.

There are two methods used to handle blocking.

Threading allows a separate sequence of execution for each API call that can block. This can prevent the overall application from stalling while waiting for a resource. This has the benefit that none of the information about the state of the API call is lost while other activities take place.

Threaded languages include the following.

AdaAfnixC++C#CILKEiffelErlang
JavaLispMagentaModula 3Napier 88OzPresto
pSatherPerl 5.8.7+PHPPythonRRubySmalltalk
Tcl/TkVUniconBallerina

Timers allow a blocked call to be interrupted. A periodic timer allows the programmer to emulate threading. Interrupts typically destroy any information related to the state of a blocked API call or intensive calculation, so the programmer must keep track of this information separately.

Un-threaded languages include the following.

BashJavascriptSQLVisual Basic

Corrupted state will occur with timers. This is avoided with the following.

Faults

Fault are induced by signals in POSIX compliant systems, and these signals originate from API calls, from the operating system, and from other applications.

Any signal that does not have handler code becomes a fault that causes premature application termination.

The handler is a function that is performed on-demand when the application receives a signal. This is called exception handling.

The termination signal is the only signal that cannot be handled. All other signals can be directed to a handler function.

Handler functions come in two broad varieties.

Initialized handler functions are paired with each signal when the software starts. This causes the handler function to startup when the corresponding signal arrives. This technique can be used with timers to emulate threading.

In-line handler functions are associated with a call using specialized syntax. The most familiar is the following used with C++ and Java.

try
{
API_call();
}
catch
{
signal_handler_code;
}

Hardware failure

Hardware fault tolerance for software requires the following.

Backup maintains information in the event that hardware must be replaced. This can be done in one of two ways.

Backup requires an information-restore strategy to make backup information available on a replacement system. The restore process is usually time-consuming, and information will be unavailable until the restore process is complete.

Redundancy relies on replicating information on more than one computer computing device so that the recovery delay is brief. This can be achieved using continuous backup to a live system that remains inactive until needed (synchronized backup).

This can also be achieved by replicating information as it is created on multiple identical systems, which can eliminate recovery delay.


See also

Related Research Articles

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU), and is an essential feature of a multiprogramming or multitasking operating system. In a traditional CPU, each process - a program in execution - utilizes the various CPU registers to store data and hold the current state of the running process. However, in a multitasking operating system, the operating system switches between processes or threads to allow the execution of multiple processes simultaneously. For every switch, the operating system must save the state of the currently running process, followed by loading the next process state, which will run on the CPU. This sequence of operations that stores the state of the running process and the loading of the following running process is called a context switch.

<span class="mw-page-title-main">Interrupt</span> Signal to a computer processor emitted by hardware or software

In digital computers, an interrupt is a request for the processor to interrupt currently executing code, so that the event can be processed in a timely manner. If the request is accepted, the processor will suspend its current activities, save its state, and execute a function called an interrupt handler to deal with the event. This interruption is often temporary, allowing the software to resume normal activities after the interrupt handler finishes, although the interrupt could instead indicate a fatal error.

<span class="mw-page-title-main">Embedded system</span> Computer system with a dedicated function

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

In computer systems programming, an interrupt handler, also known as an interrupt service routine or ISR, is a special block of code associated with a specific interrupt condition. Interrupt handlers are initiated by hardware interrupts, software interrupt instructions, or software exceptions, and are used for implementing device drivers or transitions between protected modes of operation, such as system calls.

NonStop is a series of server computers introduced to market in 1976 by Tandem Computers Inc., beginning with the NonStop product line. It was followed by the Tandem Integrity NonStop line of lock-step fault-tolerant computers, now defunct. The original NonStop product line is currently offered by Hewlett Packard Enterprise since Hewlett-Packard Company's split in 2015. Because NonStop systems are based on an integrated hardware/software stack, Tandem and later HPE also developed the NonStop OS operating system for them.

<span class="mw-page-title-main">Watchdog timer</span> Electronic timer used to detect and recover from computer malfunctions

A watchdog timer, sometimes called a computer operating properly timer, is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation.

Signals are standardized messages sent to a running program to trigger specific behavior, such as quitting or error handling. They are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the I/O operation has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. Any decrease in operating quality is proportional to the severity of the failure, unlike a naively designed system in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability to maintain functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

In programming and software design, an event is an action or occurrence recognized by software, often originating asynchronously from the external environment, that may be handled by the software. Computer events can be generated or triggered by the system, by the user, or in other ways. Typically, events are handled synchronously with the program flow. That is, the software may have one or more dedicated places where events are handled, frequently an event loop.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

<span class="mw-page-title-main">Triple modular redundancy</span> Method for increasing reliability

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Hardware virtualization is the virtualization of computers as complete hardware platforms, certain logical abstractions of their componentry, or only the functionality required to run various operating systems. Virtualization hides the physical characteristics of a computing platform from the users, presenting instead an abstract computing platform. At its origins, the software that controlled virtualization was called a "control program", but the terms "hypervisor" or "virtual machine monitor" became preferred over time.

Fault Tolerant Messaging in the context of computer systems and networks, refers to a design approach and set of techniques aimed at ensuring reliable and continuous communication between components or nodes even in the presence of errors or failures. This concept is especially critical in distributed systems, where components may be geographically dispersed and interconnected through networks, making them susceptible to various potential points of failure.

The Hardware Platform Interface (HPI) is an open specification that defines an application programming interface (API) for platform management of computer systems. The API supports tasks including reading temperature or voltage sensors built into a processor, configuring hardware registers, accessing system inventory information like model numbers and serial numbers, and performing more complex activities, such as upgrading system firmware or diagnosing system failures.

The Application Interface Specification (AIS) is a collection of open specifications that define the application programming interfaces (APIs) for high-availability application computer software. It is developed and published by the Service Availability Forum and made freely available. Besides reducing the complexity of high-availability applications and shortening development time, the specifications intended to ease the portability of applications between different middleware implementations and to admit third party developers to a field that was highly proprietary in the past.

This is a list of the individual topics in Electronics, Mathematics, and Integrated Circuits that together make up the Computer Engineering field. The organization is by topic to create an effective Study Guide for this field. The contents match the full body of topics and detail information expected of a person identifying themselves as a Computer Engineering expert as laid out by the National Council of Examiners for Engineering and Surveying. It is a comprehensive list and superset of the computer engineering topics generally dealt with at any one time.

References

  1. "Software Fault Tolerance". Carnegie Mellon University.
  2. "Portable and Fault Tolerant Software Systems" (PDF). Massachusetts Institute of Technology.
  3. Kubernetes Native Microservices with Quarkus and MicroProfile. Manning. 2022. ISBN   9781638357155.
  4. Acing the System Design Interview. Manning. 2024. ISBN   9781638355915.
  5. Understanding Distributed Systems: What every developer should know about large distributed applications. 2021. ISBN   978-1838430207.
  6. Eckhardt, D. E., "Fundamental Differences in the Reliability of N-Modular Redundancy and N-Version Programming", The Journal of Systems and Software, 8, 1988, pp. 313–318.
  7. Ray Giguette and Johnette Hassell, “Toward A Resourceful Method of Software Fault Tolerance”, ACM Southeast regional conference, April, 1999.

Further reading