Triple modular redundancy

Last updated
Triple Modular Redundancy. Three identical logic circuits (logic gates) are used to compute the specified Boolean function. The set of data at the input of the first circuit are identical to the input of the second and third gates. Triple Modular Redundancy.JPG
Triple Modular Redundancy. Three identical logic circuits (logic gates) are used to compute the specified Boolean function. The set of data at the input of the first circuit are identical to the input of the second and third gates.
3-input majority gate using 4 NAND gates Majority Logic.png
3-input majority gate using 4 NAND gates

In computing, triple modular redundancy, sometimes called triple-mode redundancy, [1] (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Contents

The TMR concept can be applied to many forms of redundancy, such as software redundancy in the form of N-version programming, and is commonly found in fault-tolerant computer systems.

Space satellite systems often use TMR, [2] [3] although satellite RAM usually uses Hamming error correction. [4]

Some ECC memory uses triple modular redundancy hardware (rather than the more common Hamming code), because triple modular redundancy hardware is faster than Hamming error correction hardware. [5] Called repetition code, some communication systems use N-modular redundancy as a simple form of forward error correction. For example, 5-modular redundancy communication systems (such as FlexRay) use the majority of 5 samples – if any 2 of the 5 results are erroneous, the other 3 results can correct and mask the fault.

Modular redundancy is a basic concept, dating to antiquity, while the first use of TMR in a computer was the Czechoslovak computer SAPO, in the 1950s.

General case

The general case of TMR is called N-modular redundancy, in which any positive number of replications of the same action is used. The number is typically taken to be at least three, so that error correction by majority vote can take place; it is also usually taken to be odd, so that no ties may happen. [6]

Majority logic gate

Truth table of 3-input voter
INPUTOUTPUT
ABC〈A, B, C〉
0000
0010
0100
1000
0111
1011
1101
1111

In TMR, three identical logic circuits (logic gates) are used to compute the same set of specified Boolean function. If there are no circuit failures, the outputs of the three circuits are identical. But due to circuit failures, the outputs of the three circuits may be different.

A majority logic gate is used to decide which of the circuits' outputs is the correct output. The majority gate output is 1 if two or more of the inputs of the majority gate are 1; output is 0 if two or more of the majority gate's inputs are 0.

The majority logic gate is a simple AND–OR circuit: if the inputs to the majority gate are denoted by x, y and z, then the output of the majority gate is

Thus, the majority gate is the carry output of a full adder, i.e., the majority gate is a voting machine. [7]

TMR operation

Assuming the Boolean function computed by the three identical logic gates has value 1, then: (a) if no circuit has failed, all three circuits produce an output of value 1, and the majority gate output has value 1. (b) if one circuit fails and produces an output of 0, while the other two are working correctly and produce an output of 1, the majority gate output is 1, i.e., it still has the correct value. And similarly for the case when the Boolean function computed by the three identical circuits has value 0. Thus, the majority gate output is guaranteed to be correct as long as no more than one of the three identical logic circuits has failed. [7]

For a TMR system with a single voter of reliability (probability of working) Rv and three components of reliability Rm, the probability of it being correct can be shown to be RTMR = Rv (3 Rm2 – 2 Rm3). [6]

TMR systems should use data scrubbing – rewrite flip-flops periodically – in order to avoid accumulation of errors. [8]

Voter

Triple modular redundancy with one voter (top) and three voters (bottom) Triple Modular Redundancy et sa variante amelioree.png
Triple modular redundancy with one voter (top) and three voters (bottom)

The majority gate itself could fail. This can be protected against by applying triple redundancy to the voters themselves. [9]

In a few TMR systems, such as the Saturn Launch Vehicle Digital Computer and functional triple modular redundancy (FTMR) systems, the voters are also triplicated. Three voters are used – one for each copy of the next stage of TMR logic. In such systems there is no single point of failure. [10] [11]

Even though only using a single voter brings a single point of failure a failed voter will bring down the entire system most TMR systems do not use triplicated voters. This is because the majority gates are much less complex than the systems that they guard against, so they are much more reliable. [7] By using the reliability calculations, it is possible to find the minimum reliability of the voter for TMR to be a win. [6]

Chronometers

To use triple modular redundancy, a ship must have at least three chronometers; two chronometers provided dual modular redundancy, allowing a backup if one should cease to work, but not allowing any error correction if the two displayed a different time, since in case of contradiction between the two chronometers, it would be impossible to know which one was wrong (the error detection obtained would be the same of having only one chronometer and checking it periodically). Three chronometers provided triple modular redundancy, allowing error correction if one of the three was wrong, so the pilot would take the average of the two with closer reading (vote for average precision).

There is an old adage to this effect, stating: "Never go to sea with two chronometers; take one or three." [12]

Mainly this means that if two chronometers contradict, how do you know which one is correct? At one time this observation or rule was an expensive one as the cost of three sufficiently accurate chronometers was more than the cost of many types of smaller merchant vessels. [13] Some vessels carried more than three chronometers – for example, HMS Beagle carried 22 chronometers. [14] However, such a large number was usually only carried on ships undertaking survey work as was the case with the Beagle.

In the modern era, ships at sea use GNSS navigation receivers (with GPS, GLONASS & WAAS etc. support) – mostly running with WAAS or EGNOS support so as to provide accurate time (and location).

See also

Related Research Articles

<span class="mw-page-title-main">Logic gate</span> Device performing a Boolean function

A logic gate is an idealized or physical device that performs a Boolean function, a logical operation performed on one or more binary inputs that produces a single binary output.

In Boolean logic, the majority function is the Boolean function that evaluates to false when half or more arguments are false and true otherwise, i.e. the value of the function equals the value of the majority of the inputs. Representing true values as 1 and false values as 0, we may use the (real-valued) formula:

<span class="mw-page-title-main">Digital electronics</span> Electronic circuits that utilize digital signals

Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is in contrast to analog electronics and analog signals.

Electronic design automation (EDA), also referred to as electronic computer-aided design (ECAD), is a category of software tools for designing electronic systems such as integrated circuits and printed circuit boards. The tools work together in a design flow that chip designers use to design and analyze entire semiconductor chips. Since a modern semiconductor chip can have billions of components, EDA tools are essential for their design; this article in particular describes EDA specifically with respect to integrated circuits (ICs).

Quantum error correction (QEC) is used in quantum computing to protect quantum information from errors due to decoherence and other quantum noise. Quantum error correction is theorised as essential to achieve fault tolerant quantum computing that can reduce the effects of noise on stored quantum information, faulty quantum gates, faulty quantum preparation, and faulty measurements. This would allow algorithms of greater circuit depth.

Radiation hardening is the process of making electronic components and circuits resistant to damage or malfunction caused by high levels of ionizing radiation, especially for environments in outer space, around nuclear reactors and particle accelerators, or during nuclear accidents or nuclear warfare.

Formal equivalence checking process is a part of electronic design automation (EDA), commonly used during the development of digital integrated circuits, to formally prove that two representations of a circuit design exhibit exactly the same behavior.

In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

LEON is a radiation-tolerant 32-bit central processing unit (CPU) microprocessor core that implements the SPARC V8 instruction set architecture (ISA) developed by Sun Microsystems. It was originally designed by the European Space Research and Technology Centre (ESTEC), part of the European Space Agency (ESA), without any involvement by Sun. Later versions have been designed by Gaisler Research, under a variety of owners. It is described in synthesizable VHSIC Hardware Description Language (VHDL). LEON has a dual license model: An GNU Lesser General Public License (LGPL) and GNU General Public License (GPL) free and open-source software (FOSS) license that can be used without licensing fee, or a proprietary license that can be purchased for integration in a proprietary product. The core is configurable through VHDL generics, and is used in system on a chip (SOC) designs both in research and commercial settings.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

<span class="mw-page-title-main">ECC memory</span> Self-correcting computer data storage

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

N-version programming (NVP), also known as multiversion programming or multiple-version dissimilar software, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications. The concept of N-version programming was introduced in 1977 by Liming Chen and Algirdas Avizienis with the central conjecture that the "independence of programming efforts will greatly reduce the probability of identical software faults occurring in two or more versions of the program". The aim of NVP is to improve the reliability of software operation by building in fault tolerance or redundancy.

Triconex is both the name of a Schneider Electric brand that supplies products, systems, and services for safety, critical control, and turbomachinery applications and the name of its hardware devices that utilize its TriStation application software. Triconex products are based on patented Triple modular redundancy (TMR) industrial safety-shutdown technology. Today, Triconex TMR products operate globally in more than 11,500 installations.

In quantum computing, the threshold theorem states that a quantum computer with a physical error rate below a certain threshold can, through application of quantum error correction schemes, suppress the logical error rate to arbitrarily low levels. This shows that quantum computers can be made fault-tolerant, as an analogue to von Neumann's threshold theorem for classical computation. This result was proven by the groups of Dorit Aharanov and Michael Ben-Or; Emanuel Knill, Raymond Laflamme, and Wojciech Zurek; and Alexei Kitaev independently. These results built off a paper of Peter Shor, which proved a weaker version of the threshold theorem.

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.

This is a list of the individual topics in Electronics, Mathematics, and Integrated Circuits that together make up the Computer Engineering field. The organization is by topic to create an effective Study Guide for this field. The contents match the full body of topics and detail information expected of a person identifying themselves as a Computer Engineering expert as laid out by the National Council of Examiners for Engineering and Surveying. It is a comprehensive list and superset of the computer engineering topics generally dealt with at any one time.

References

  1. "David Ratter. "FPGAs on Mars"" (PDF). Retrieved May 30, 2020.
  2. "Actel engineers use triple-module redundancy in new rad-hard FPGA". Military & Aerospace Electronics. Retrieved 2017-04-09.
  3. ECSS-Q-HB-60-02A  : Techniques for radiation effects mitigation in ASICs and FPGAs handbook
  4. "Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment". radhome.gsfc.nasa.gov. Archived from the original on March 4, 2001. Retrieved May 30, 2020.
  5. "Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite". Tsinghua Space Center, Tsinghua University, Beijing. Archived from the original on 2011-10-02. Retrieved 2009-02-16.
  6. 1 2 3 Shooman, Martin L. (2002). "N-Modular Redundancy". Reliability of computer systems and networks: fault tolerance, analysis and design . Wiley-Interscience. pp.  145–201. doi:10.1002/047122460X.ch4. ISBN   9780471293422. Course notes
  7. 1 2 3 Dilip V. Sarwate, Lecture Notes for ECE 413 – Probability with Engineering Applications, Department of Electrical and Computer Engineering (ECE), UIUC College of Engineering, University of Illinois at Urbana-Champaign
  8. Zabolotny, Wojciech M.; Kudla, Ignacy M.; Pozniak, Krzysztof T.; Bunkowski, Karol; Kierzkowski, Krzysztof; Wrochna, Grzegorz; Krolikowski, Jan (2005-09-16). "Radiation tolerant design of RLBCS system for RPC detector in LHC experiment". In Romaniuk, Ryszard S.; Simrock, Stefan; Lutkovski, Vladimir M. (eds.). Photonics Applications in Industry and Research IV. Vol. 5948. Warsaw, Poland. pp. 59481E. doi:10.1117/12.622864. S2CID   15987757.{{cite book}}: CS1 maint: location missing publisher (link)
  9. A.W. Krings."Redundancy".2007
  10. Sandi Habinc (2002). "Functional Triple Modular Redundancy (FTMR): VHDL Design Methodology for Redundancy in Combinatorial and Sequential Logic" (PDF). Archived from the original (PDF) on 2012-06-05.
  11. Lyons, R. E.; Vanderkulk, W. (April 1962). "The Use of Triple-Modular Redundancy to Improve Computer Reliability" (PDF). IBM Journal of Research and Development. 6 (2): 200–209. doi:10.1147/rd.62.0200.
  12. Brooks, Frederick J. (1995) [1975]. The Mythical Man-Month. Addison-Wesley. p.  64. ISBN   978-0-201-83595-3.
  13. "Re: Longitude as a Romance". Irbs.com, Navigation mailing list. 2001-07-12. Archived from the original on 2011-05-20. Retrieved 2009-02-16.
  14. R. Fitzroy. "Volume II: Proceedings of the Second Expedition". p. 18.