Fault tolerance

Last updated

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation. [1]

Contents

A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. [2] The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured, or a structure that is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.

Within the scope of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.

History

The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonín Svoboda. [3] :155 Its basic design was magnetic drums connected via relays, with a voting method of memory error detection (triple modular redundancy). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: machines that would last a long time without any maintenance, such as the ones used on NASA space probes and satellites; computers that were very dependable but required constant monitoring, such as those used to monitor and control nuclear power plants or supercollider experiments; and finally, computers with a high amount of runtime which would be under heavy use, such as many of the supercomputers used by insurance companies for their probability monitoring.

Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, [4] in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working, as of early 2022. [5]

Hyper-dependable computers were pioneered mostly by aircraft manufacturers, [3] :210 nuclear power companies, and the railroad industry in the US. These needed computers with massive amounts of uptime that would fail gracefully enough with a fault to allow continued operation while relying on the fact that the computer output would be constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys, and General Electric built their own. [3] :223

In the 1970s, much work happened in the field. [6] [7] [8] For instance, F14 CADC had built-in self-test and redundancy. [9]

In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. [10] Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.

Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results, with the outcome that if, for example, four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.

Historically, the motion has always been to move further from N-model and more to M out of N due to the fact that the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.

Tandem Computers, in 1976 [11] and Stratus were among the first companies specializing in the design of fault-tolerant computer systems for online transaction processing.

Examples

"M2 Mobile Web", the original mobile web front end of Twitter, later served as fallback legacy version to clients without JavaScript support and/or incompatible browsers until December 2020. Twitter M2 mobile website.png
"M2 Mobile Web", the original mobile web front end of Twitter, later served as fallback legacy version to clients without JavaScript support and/or incompatible browsers until December 2020.

Hardware fault tolerance sometimes requires that broken parts be taken out and replaced with new parts while the system is still operational (in computing known as hot swapping ). Such a system implemented with a single backup is known as single point tolerant and represents the vast majority of fault-tolerant systems. In such systems the mean time between failures should be long enough for the operators to have sufficient time to fix the broken devices (mean time to repair) before the backup also fails. It is helpful if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system.

Fault tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single-point tolerance to create their NonStop systems with uptimes measured in years.

Fail-safe architectures may encompass also the computer software, for example by process replication.

Data formats may also be designed to degrade gracefully. HTML for example, is designed to be forward compatible, allowing Web browsers to ignore new and unsupported HTML entities without causing the document to be unusable. Additionally, some sites, including popular platforms such as Twitter (until December 2020), provide an optional lightweight front end that does not rely on JavaScript and has a minimal layout, to ensure wide accessibility and outreach, such as on game consoles with limited web browsing capabilities. [12] [13]

Terminology

An example of graceful degradation by design in an image with transparency. Each of the top two images is the result of viewing the composite image in a viewer that recognises transparency. The bottom two images are the result in a viewer with no support for transparency. Because the transparency mask (center bottom) is discarded, only the overlay (center top) remains; the image on the left has been designed to degrade gracefully, hence is still meaningful without its transparency information. Graceful Degradation of Transparency.png
An example of graceful degradation by design in an image with transparency. Each of the top two images is the result of viewing the composite image in a viewer that recognises transparency. The bottom two images are the result in a viewer with no support for transparency. Because the transparency mask (center bottom) is discarded, only the overlay (center top) remains; the image on the left has been designed to degrade gracefully, hence is still meaningful without its transparency information.

A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.

A system that is designed to fail safe, or fail-secure, or fail gracefully, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit (as opposed to an uncontrolled crash) in order to prevent data corruption after experiencing an error. [14] A similar distinction is made between "failing well" and "failing badly".

A system that is designed to experience graceful degradation, or to fail soft (used in computing, similar to "fail safe" [15] ) operates at a reduced level of performance after some component failures. For example, a building may operate lighting at reduced levels and elevators at reduced speeds if grid power fails, rather than either trapping people in the dark completely or continuing to operate at full power. In computing an example of graceful degradation is that if insufficient network bandwidth is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version. Progressive enhancement is an example in computing, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display available.

In fault-tolerant computer systems, programs that are considered robust are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely. Software brittleness is the opposite of robustness. Resilient networks continue to transmit data despite the failure of some links or nodes; resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.

A system with high failure transparency will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. [16] Likewise, a fail-fast component is designed to report at the first point of failure, rather than allow downstream components to fail and generate reports then. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.

A single fault condition is a situation where one means for protection against a hazard is defective. If a single fault condition results unavoidably in another single fault condition, the two failures are considered as one single fault condition. [17] A source offers the following example:

A single-fault condition is a condition when a single means for protection against hazard in equipment is defective or a single external abnormal condition is present, e.g. short circuit between the live parts and the applied part. [18]

Criteria

Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant: [19]

An example of a component that passes all the tests is a car's occupant restraint system. While the primary occupant restraint system is not normally thought of, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so the first test is passed. Accidents causing occupant ejection were quite common before seat belts, so the second test is passed. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so the third test is passed. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as airbags, are more expensive and so pass that test by a smaller margin.

Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness). Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure, the loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely (a relatively very rare occurrence), there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission/engine braking so long as the demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case.

In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring it to a halt.

On motorcycles, a similar level of fail-safety is provided by simpler methods; first, the front and rear brake systems are entirely separate, regardless of their method of activation (that can be cable, rod or hydraulic), allowing one to fail entirely while leaving the other unaffected. Second, the rear brake is relatively strong compared to its automotive cousin, being a powerful disc on some sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tire is generally larger and has better traction, so that the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before the wheel locks. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.

Requirements

The basic characteristics of fault tolerance require:

  1. No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.
  2. Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect.[ where? ][ clarification needed ]
  3. Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "rogue transmitter" that can swamp legitimate communication in a system and cause overall system failure. Firewalls or other mechanisms that isolate a rogue transmitter or failing component to protect the system are required.
  4. Availability of reversion modes [ clarification needed ]

In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.

Fault-tolerant systems are typically based on the concept of redundancy.

Fault tolerance techniques

Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport, public utilities and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more. [20]

Replication

Spare components address the first fundamental characteristic of fault tolerance in three ways:

All implementations of RAID, redundant array of independent disks, except RAID 0, are examples of a fault-tolerant storage device that uses data redundancy.

A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.

Lockstep fault-tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.

Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.

Failure-oblivious computing

Failure-oblivious computing is a technique that enables computer programs to continue executing despite errors. [21] The technique can be applied in different contexts. It can handle invalid memory reads by returning a manufactured value to the program, [22] which in turn, makes use of the manufactured value and ignores the former memory value it tried to access, this is a great contrast to typical memory checkers, which inform the program of the error or abort the program.

The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%. [23]

Recovery shepherding

Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero. [24] Comparing to the failure oblivious computing technique, recovery shepherding works on the compiled program binary directly and does not need to recompile to program.

It uses the just-in-time binary instrumentation framework Pin. It attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from the process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead. [24] For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs. [24]

Circuit breaker

The circuit breaker design pattern is a technique to avoid catastrophic failures in distributed systems.

Redundancy

Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. [25] This can consist of backup components that automatically "kick in" if one component fails. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with the exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail). The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s. [26]

Two kinds of redundancy are possible: [27] space redundancy and time redundancy. Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this kind of testing is referred to as 'In Service Fault Tolerance Testing or ISFTT for short.

Disadvantages

Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:

There is a difference between fault tolerance and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant. But when a fault did occur they still stopped operating completely, and therefore were not fault tolerant.

See also

Related Research Articles

<span class="mw-page-title-main">Safety engineering</span> Engineering discipline which assures that engineered systems provide acceptable levels of safety

Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.

In engineering, a fail-safe is a design feature or practice that, in the event of a failure of the design feature, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safety to a particular hazard, a system being "fail-safe" does not mean that failure is impossible or improbable, but rather that the system's design prevents or mitigates unsafe consequences of the system's failure. That is, if and when a "fail-safe" system fails, it remains at least as safe as it was before the failure. Since many types of failure are possible, failure mode and effects analysis is used to examine failure situations and recommend safety design and procedures.

Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, 911 systems, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded by Jimmy Treybig in 1974 in Cupertino, California. It remained independent until 1997, when it became a server division within Compaq. It is now a server division within Hewlett Packard Enterprise, following Hewlett-Packard's acquisition of Compaq and the split of Hewlett-Packard into HP Inc. and Hewlett Packard Enterprise.

<span class="mw-page-title-main">Safety-critical system</span> System whose failure would be serious

A safety-critical system or life-critical system is a system whose failure or malfunction may result in one of the following outcomes:

NonStop is a series of server computers introduced to market in 1976 by Tandem Computers Inc., beginning with the NonStop product line. It was followed by the Tandem Integrity NonStop line of lock-step fault tolerant computers, now defunct. The original NonStop product line is currently offered by Hewlett Packard Enterprise since Hewlett-Packard Company's split in 2015. Because NonStop systems are based on an integrated hardware/software stack, Tandem and later HPE also developed the NonStop OS operating system for them.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

TACL is the scripting programming language which acts as the shell in Tandem Computers/NonStop computers.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

<span class="mw-page-title-main">Brake-by-wire</span> Automotive technology

Brake-by-wire technology in the automotive industry is the ability to control brakes through electronic means, without a mechanical connection that transfers force to the physical braking system from a driver input apparatus such as a pedal or lever.

<span class="mw-page-title-main">Triple modular redundancy</span> Method for increasing reliability

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Fault Tolerant Messaging in the context of computer systems and networks, refers to a design approach and set of techniques aimed at ensuring reliable and continuous communication between components or nodes even in the presence of errors or failures. This concept is especially critical in distributed systems, where components may be geographically dispersed and interconnected through networks, making them susceptible to various potential points of failure.

<span class="mw-page-title-main">Single point of failure</span> A part whose failure will disrupt the entire system

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

In computer science, robustness is the ability of a computer system to cope with errors during execution and cope with erroneous input. Robustness can encompass many areas of computer science, such as robust programming, robust machine learning, and Robust Security Network. Formal techniques, such as fuzz testing, are essential to showing robustness since this type of testing involves invalid or unexpected inputs. Alternatively, fault injection can be used to test robustness. Various commercial products perform robustness testing of software analysis.

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

References

  1. Adaptive Fault Tolerance and Graceful Degradation, Oscar González et al., 1997, University of Massachusetts - Amherst
  2. Johnson, B. W. (1984). "Fault-Tolerant Microprocessor-Based Systems", IEEE Micro, vol. 4, no. 6, pp. 6–21
  3. 1 2 3 Daniel P. Siewiorek; C. Gordon Bell; Allen Newell (1982). Computer Structures: Principles and Examples . McGraw-Hill. ISBN   0-07-057302-6.
  4. Algirdas Avižienis; George C. Gilley; Francis P. Mathur; David A. Rennels; John A. Rohr; David K. Rubin. "The STAR (Self-Testing And Repairing) Computer: An Investigation Of the Theory and Practice Of Fault-tolerant Computer Design" (PDF).
  5. "Voyager Mission state (more often than not at least three months out of date)". NASA. Retrieved 2022-04-01.
  6. Randell, Brian; Lee, P.A.; Treleaven, P. C. (June 1978). "Reliability Issues in Computing System Design". ACM Computing Surveys . 10 (2): 123–165. doi:10.1145/356725.356729. ISSN   0360-0300. S2CID   16909447.
  7. P. J. Denning (December 1976). "Fault tolerant operating systems". ACM Computing Surveys. 8 (4): 359–389. doi:10.1145/356678.356680. ISSN   0360-0300. S2CID   207736773.
  8. Theodore A. Linden (December 1976). "Operating System Structures to Support Security and Reliable Software". ACM Computing Surveys. 8 (4): 409–445. doi:10.1145/356678.356682. hdl: 2027/mdp.39015086560037 . ISSN   0360-0300. S2CID   16720589.
  9. Ray Holt. "The F14A Central Air Data Computer, and the LSI Technology State-of-the-Art in 1968".
  10. Fault tolerant computing in computer design Neilforoshan, M.R Journal of Computing Sciences in Colleges archive Volume 18, Issue 4 (April 2003) Pages: 213 – 220, ISSN   1937-4771
  11. "History of TANDEM COMPUTERS, INC". FundingUniverse. Retrieved 2023-03-01.
  12. Nathaniel (17 March 2021). "Why your website should work without JavaScript". DEV Community. Retrieved 2021-05-16.
  13. Fairfax, Zackerie (2020-11-28). "Legacy Twitter Shutdown Means You Can't Tweet From The 3DS Anymore". Screen Rant . Retrieved 2021-07-01.
  14. Hudak, J.J.; Suh, B.-H.; Siewiorek, D.P.; Segall, Z. (1993). "Evaluation and comparison of fault-tolerant software techniques". IEEE Transactions on Reliability. 42 (2): 190–204. doi:10.1109/24.229487. ISSN   1558-1721.
  15. Stallings, W (2009): Operating Systems. Internals and Design Principles, sixth edition
  16. Thampi, Sabu M. (2009-11-23). "Introduction to Distributed Systems". arXiv: 0911.4395 [cs.DC].
  17. "Control". Grouper.ieee.org. Archived from the original on 1999-10-08. Retrieved 2016-04-06.
  18. Baha Al-Shaikh, Simon G. Stacey, Essentials of Equipment in Anaesthesia, Critical Care, and Peri-Operative Medicine (2017), p. 247.
  19. Dubrova, E. (2013). "Fault-Tolerant Design", Springer, 2013, ISBN   978-1-4614-2112-2
  20. Reliability evaluation of some fault-tolerant computer architectures. Springer-Verlag. November 1980. ISBN   978-3-540-10274-8.
  21. Herzberg, Amir; Shulman, Haya (2012). "Oblivious and Fair Server-Aided Two-Party Computation". 2012 Seventh International Conference on Availability, Reliability and Security. IEEE. pp. 75–84. doi:10.1109/ares.2012.28. ISBN   978-1-4673-2244-7. S2CID   6579295.
  22. Rigger, Manuel; Pekarek, Daniel; Mössenböck, Hanspeter (2018), "Context-Aware Failure-Oblivious Computing as a Means of Preventing Buffer Overflows", Network and System Security, Lecture Notes in Computer Science, Cham: Springer International Publishing, vol. 11058, pp. 376–390, arXiv: 1806.09026 , doi:10.1007/978-3-030-02744-5_28, ISBN   978-3-030-02743-8 , retrieved 2020-10-07
  23. Keromytis, Angelos D. (2007), "Characterizing Software Self-Healing Systems", in Gorodetski, Vladimir I.; Kotenko, Igor; Skormin, Victor A. (eds.), Characterizing Software Self-Healing Systems, Computer network security: Fourth International Conference on Mathematical Methods, Models, and Architectures for Computer Network Security, Springer, ISBN   978-3-540-73985-2
  24. 1 2 3 Long, Fan; Sidiroglou-Douskos, Stelios; Rinard, Martin (2014). "Automatic Runtime Error Repair and Containment via Recovery Shepherding". Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI '14'. New York, NY, US: ACM. pp. 227–238. doi: 10.1145/2594291.2594337 . ISBN   978-1-4503-2784-8. S2CID   6252501 .
  25. Laprie, J. C. (1985). "Dependable Computing and Fault Tolerance: Concepts and Terminology", Proceedings of 15th International Symposium on Fault-Tolerant Computing (FTSC-15), pp. 2–11
  26. von Neumann, J. (1956). "Probabilistic Logics and Synthesis of Reliable Organisms from Unreliable Components", in Automata Studies, eds. C. Shannon and J. McCarthy, Princeton University Press, pp. 43–98
  27. Avizienis, A. (1976). "Fault-Tolerant Systems", IEEE Transactions on Computers, vol. 25, no. 12, pp. 1304–1312