Software fault tolerance

Last updated

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures. [1] [2]

Contents

Introduction

The only thing constant is change. This is certainly more true of software systems than almost any phenomenon, [3] not all software change in the same way so software fault tolerance methods are designed to overcome execution errors by modifying variable values to create an acceptable program state. [4] The need to control software fault is one of the most rising challenges facing software industries today. Fault tolerance must be a key consideration in the early stage of software development.

There exist different mechanisms for software fault tolerance, among which:

Operating system failure

Computer applications make a call using the application programming interface (API) to access shared resources, like the keyboard, mouse, screen, disk drive, network, and printer. These can fail in two ways.

Blocked calls

A blocked call is a request for services from the operating system that halts the computer program until results are available.

As an example, the TCP call blocks until a response becomes available from a remote server. This occurs every time you perform an action with a web browser. Intensive calculations cause lengthy delays with the same effect as a blocked API call.

There are two methods used to handle blocking.

Threading allows a separate sequence of execution for each API call that can block. This can prevent the overall application from stalling while waiting for a resource. This has the benefit that none of the information about the state of the API call is lost while other activities take place.

Threaded languages include the following.

AdaAfnixC++C#CILKEiffelErlang
JavaLispMagentaModula 3Napier 88OzPresto
pSatherPerl 5.8.7+PHPPythonRRubySmalltalk
Tcl/TkVUniconBallerina

Timers allow a blocked call to be interrupted. A periodic timer allows the programmer to emulate threading. Interrupts typically destroy any information related to the state of a blocked API call or intensive calculation, so the programmer must keep track of this information separately.

Un-threaded languages include the following.

BashJavascriptSQLVisual Basic

Corrupted state will occur with timers. This is avoided with the following.

Faults

Fault are induced by signals in POSIX compliant systems, and these signals originate from API calls, from the operating system, and from other applications.

Any signal that does not have handler code becomes a fault that causes premature application termination.

The handler is a function that is performed on-demand when the application receives a signal. This is called exception handling.

The termination signal is the only signal that cannot be handled. All other signals can be directed to a handler function.

Handler functions come in two broad varieties.

Initialized handler functions are paired with each signal when the software starts. This causes the handler function to startup when the corresponding signal arrives. This technique can be used with timers to emulate threading.

In-line handler functions are associated with a call using specialized syntax. The most familiar is the following used with C++ and Java.

try
{
API_call();
}
catch
{
signal_handler_code;
}

Hardware failure

Hardware fault tolerance for software requires the following.

Backup maintains information in the event that hardware must be replaced. This can be done in one of two ways.

Backup requires an information-restore strategy to make backup information available on a replacement system. The restore process is usually time-consuming, and information will be unavailable until the restore process is complete.

Redundancy relies on replicating information on more than one computer computing device so that the recovery delay is brief. This can be achieved using continuous backup to a live system that remains inactive until needed (synchronized backup).

This can also be achieved by replicating information as it is created on multiple identical systems, which can eliminate recovery delay.


See also

Related Research Articles

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state. This allows multiple processes to share a single central processing unit (CPU), and is an essential feature of a multitasking operating system. In a traditional CPU, each process - a program in execution - utilizes the various CPU registers to store data and hold the current state of the running process. However, in a multitasking operating system, the operating system switches between processes or threads to allow the execution of multiple processes simultaneously. For every switch, the operating system must save the state of the currently running process, followed by loading the next process state, which will run on the CPU. This sequence of operations that stores the state of the running process and the loading of the following running process is called a context switch.

<span class="mw-page-title-main">Interrupt</span> Signal to a computer processor emitted by hardware or software

In digital computers, an interrupt is a request for the processor to interrupt currently executing code, so that the event can be processed in a timely manner. If the request is accepted, the processor will suspend its current activities, save its state, and execute a function called an interrupt handler to deal with the event. This interruption is often temporary, allowing the software to resume normal activities after the interrupt handler finishes, although the interrupt could instead indicate a fatal error.

<span class="mw-page-title-main">Embedded system</span> Computer system with a dedicated function

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use today. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

In computer systems programming, an interrupt handler, also known as an interrupt service routine or ISR, is a special block of code associated with a specific interrupt condition. Interrupt handlers are initiated by hardware interrupts, software interrupt instructions, or software exceptions, and are used for implementing device drivers or transitions between protected modes of operation, such as system calls.

<span class="mw-page-title-main">NonStop (server computers)</span>

NonStop is a series of server computers introduced to market in 1976 by Tandem Computers Inc., beginning with the NonStop product line. It was followed by the Tandem Integrity NonStop line of lock-step fault tolerant computers, now defunct. The original NonStop product line is currently offered by Hewlett Packard Enterprise since Hewlett-Packard Company's split in 2015. Because NonStop systems are based on an integrated hardware/software stack, Tandem and later HPE also developed the NonStop OS operating system for them.

<span class="mw-page-title-main">Watchdog timer</span> Electronic timer used to detect and recover from computer malfunctions

A watchdog timer is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation.

Signals are standardized messages sent to a running program to trigger specific behavior, such as quitting or error handling. They are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems.

TACL is the scripting programming language which acts as the shell in Tandem Computers/NonStop computers.

In computer science, asynchronous I/O is a form of input/output processing that permits other processing to continue before the transmission has finished. A name used for asynchronous I/O in the Windows API is overlapped I/O.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

In programming and software design, an event is an action or occurrence recognized by software, often originating asynchronously from the external environment, that may be handled by the software. Computer events can be generated or triggered by the system, by the user, or in other ways. Typically, events are handled synchronously with the program flow; that is, the software may have one or more dedicated places where events are handled, frequently an event loop.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

Hardware virtualization is the virtualization of computers as complete hardware platforms, certain logical abstractions of their componentry, or only the functionality required to run various operating systems. Virtualization hides the physical characteristics of a computing platform from the users, presenting instead an abstract computing platform. At its origins, the software that controlled virtualization was called a "control program", but the terms "hypervisor" or "virtual machine monitor" became preferred over time.

The Hardware Platform Interface (HPI) is an open specification that defines an application programming interface (API) for platform management of computer systems. The API supports tasks including reading temperature or voltage sensors built into a processor, configuring hardware registers, accessing system inventory information like model numbers and serial numbers, and performing more complex activities, such as upgrading system firmware or diagnosing system failures.

The Application Interface Specification (AIS) is a collection of open specifications that define the application programming interfaces (APIs) for high-availability application computer software. It is developed and published by the Service Availability Forum and made freely available. Besides reducing the complexity of high-availability applications and shortening development time, the specifications intended to ease the portability of applications between different middleware implementations and to admit third party developers to a field that was highly proprietary in the past.

Continuous availability is an approach to computer system and application design that protects users against downtime, whatever the cause and ensures that users remain connected to their documents, data files and business applications. Continuous availability describes the information technology methods to ensure business continuity.

This is a list of the individual topics in Electronics, Mathematics, and Integrated Circuits that together make up the Computer Engineering field. The organization is by topic to create an effective Study Guide for this field. The contents match the full body of topics and detail information expected of a person identifying themselves as a Computer Engineering expert as laid out by the National Council of Examiners for Engineering and Surveying. It is a comprehensive list and superset of the computer engineering topics generally dealt with at any one time.

References

  1. "Software Fault Tolerance". Carnegie Mellon University.
  2. "Portable and Fault Tolerant Software Systems" (PDF). Massachusetts Institute of Technology.
  3. Eckhardt, D. E., "Fundamental Differences in the Reliability of N-Modular Redundancy and N-Version Programming", The Journal of Systems and Software, 8, 1988, pp. 313–318.
  4. Ray Giguette and Johnette Hassell, “Toward A Resourceful Method of Software Fault Tolerance”, ACM Southeast regional conference, April, 1999.

Further reading