Redundancy (engineering)

Last updated
Common redundant power supply PC-Netzteil (redundant).jpg
Common redundant power supply
Redundant subsystem "B" Reliability block diagram.png
Redundant subsystem "B"
Extensively redundant rear lighting installation on a Thai tour bus Reisebus.Heck.jpg
Extensively redundant rear lighting installation on a Thai tour bus

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Contents

In many safety-critical systems, such as fly-by-wire and hydraulic systems in aircraft, some parts of the control system may be triplicated, [1] which is formally termed triple modular redundancy (TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are designed to preclude common failure modes (which can then be modelled as independent failure), the probability of all three failing is calculated to be extraordinarily small; it is often outweighed by other risk factors, such as human error. Electrical surges arising from lightning strikes are an example of a failure mode which is difficult to fully isolate, unless the components are powered from independent power busses and have no direct electrical pathway in their interconnect (communication by some means is required for voting). Redundancy may also be known by the terms "majority voting systems" [2] or "voting logic". [3]

A suspension bridge's numerous cables are a form of redundancy. Bridge-suspension.svg
A suspension bridge's numerous cables are a form of redundancy.

Redundancy sometimes produces less, instead of greater reliability  it creates a more complex system which is prone to various issues, it may lead to human neglect of duty, and may lead to higher production demands which by overstressing the system may make it less safe. [4]

Redundancy is one form of robustness as practiced in computer science.

Geographic redundancy has become important in the data center industry, to safeguard data against natural disasters and political instability (see below).

Forms of redundancy

In computer science, there are four major forms of redundancy: [5]

A modified form of software redundancy, applied to hardware may be:

Structures are usually designed with redundant parts as well, ensuring that if one part fails, the entire structure will not collapse. A structure without redundancy is called fracture-critical, meaning that a single broken component can cause the collapse of the entire structure. Bridges that failed due to lack of redundancy include the Silver Bridge and the Interstate 5 bridge over the Skagit River.

Parallel and combined systems demonstrate different level of redundancy. The models are subject of studies in reliability and safety engineering. [6]

Dissimilar redundancy

Unlike traditional redundancy, which uses more than one of the same thing, dissimilar redundancy uses different things. The idea is that the different things are unlikely to contain identical flaws. The voting method may involve additional complexity if the two things take different amounts of time. Dissimilar redundancy is often used with software, because identical software contains identical flaws.

The chance of failure is reduced by using at least two different types of each of the following

Geographic redundancy

Geographic redundancy corrects the vulnerabilities of redundant devices deployed by geographically separating backup devices. Geographic redundancy reduces the likelihood of events such as power outages, floods, HVAC failures, lightning strikes, tornadoes, building fires, wildfires, and mass shootings disabling most of the system if not the entirety of it.

Geographic redundancy locations can be

The following methods can reduce the risks of damage by a fire conflagration:

Geographic redundancy is used by Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, Netflix, Dropbox, Salesforce, LinkedIn, PayPal, Twitter, Facebook, Apple iCloud, Cisco Meraki, and many others to provide geographic redundancy, high availability, fault tolerance and to ensure availability and reliability for their cloud services. [15]

As another example, to minimize risk of damage from severe windstorms or water damage, buildings can be located at least 2 miles (3.2 km) away from the shore, with an elevation of at least 5 feet (1.5 m) above sea level. For additional protection, they can be located at least 100 feet (30 m) away from flood plain areas. [16] [17]

Functions of redundancy

The two functions of redundancy are passive redundancy and active redundancy. Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity.

Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety.

Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness but directionality is lost. Performance decline is commonly associated with passive redundancy when a limited number of failures occur.

Active redundancy eliminates performance declines by monitoring the performance of individual devices, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures the components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy.

Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line includes monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines.[ citation needed ] At the Toronto Airport, there are 4 redundant electrical lines. Each of the 4 lines supply enough power for the entire airport. A spot network substation uses reverse current relays to open breakers to lines that fail, but lets power continue to flow the airport.

Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events such as an earthquake.

Disadvantages

Charles Perrow, author of Normal Accidents , has said that sometimes redundancies backfire and produce less, not more reliability. This may happen in three ways: First, redundant safety devices result in a more complex system, more prone to errors and accidents. Second, redundancy may lead to shirking of responsibility among workers. Third, redundancy may lead to increased production pressures, resulting in a system that operates at higher speeds, but less safely. [4]

Voting logic

Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involves computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of a form of non-computer voting logic.

The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault. The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both outputs to be active or inactive at the same time, or cause outputs to flutter on and off.

A more reliable form of voting logic involves an odd number of three devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the Space Shuttle.

Calculating the probability of system failure

Each duplicate component added to the system decreases the probability of system failure according to the formula:-

where:

This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to the same socket in such a way that if one power supply failed, the other would too.

It also assumes that only one component is needed to keep the system running.

Redundancy and high availability


You can achieve higher availability through redundancy. Let's say you have three redundant components: A, B and C. You can use following formula to calculate availability of the overall system:

Availability of redundant components = 1 - (1 - availability of component A) X (1 - availability of component B) X (1 - availability of component C) [18] [19]

In corollary, if you have N parallel components each having X availability, then:

Availability of parallel components = 1 - (1 - X)^ N

10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability. System availability chart.png
10 hosts, each having 50% availability. But if they are used in parallel and fail independently, they can provide high availability.

Using redundant components can exponentially increase the availability of overall system. [19]  For example if each of your hosts has only 50% availability, by using 10 of hosts in parallel, you can achieve 99.9023% availability.

Note that redundancy doesn’t always lead to higher availability. In fact, redundancy increases complexity which in turn reduces availability. According to Marc Brooker, to take advantage of redundancy, ensure that: [20]

  1. You achieve a net-positive improvement in the overall availability of your system
  2. Your redundant components fail independently
  3. Your system can reliably detect healthy redundant components
  4. Your system can reliably scale out and scale-in redundant components.

See also

Related Research Articles

<span class="mw-page-title-main">Safety engineering</span> Engineering discipline which assures that engineered systems provide acceptable levels of safety

Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.

<span class="mw-page-title-main">Digital electronics</span> Electronic circuits that utilize digital signals

Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is in contrast to analog electronics which work primarily with analog signals. Despite the name, digital electronics designs includes important analog design considerations.

In engineering, a fail-safe is a design feature or practice that, in the event of a failure of the design feature, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safety to a particular hazard, a system being "fail-safe" does not mean that failure is naturally inconsequential, but rather that the system's design prevents or mitigates unsafe consequences of the system's failure. If and when a "fail-safe" system fails, it remains at least as safe as it was before the failure. Since many types of failure are possible, failure mode and effects analysis is used to examine failure situations and recommend safety design and procedures.

<span class="mw-page-title-main">Fault tree analysis</span> Failure analysis system used in safety engineering and reliability engineering

Fault tree analysis (FTA) is a type of failure analysis in which an undesired state of a system is examined. This analysis method is mainly used in safety engineering and reliability engineering to understand how systems can fail, to identify the best ways to reduce risk and to determine event rates of a safety accident or a particular system level (functional) failure. FTA is used in the aerospace, nuclear power, chemical and process, pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk factor identification relating to social service system failure. FTA is also used in software engineering for debugging purposes and is closely related to cause-elimination technique used to detect bugs.

<span class="mw-page-title-main">Safety-critical system</span> System whose failure would be serious

A safety-critical system or life-critical system is a system whose failure or malfunction may result in one of the following outcomes:

<span class="mw-page-title-main">FADEC</span> Computer used for engine control in aerospace engineering

A full authority digital enginecontrol (FADEC) is a system consisting of a digital computer, called an "electronic engine controller" (EEC) or "engine control unit" (ECU), and its related accessories that control all aspects of aircraft engine performance. FADECs have been produced for both piston engines and jet engines.

In electronics and computing, a soft error is a type of error where a signal or datum is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before. One cause of soft errors is single event upsets from cosmic rays.

Lockstep systems are fault-tolerant computer systems that run the same set of operations at the same time in parallel. The redundancy (duplication) allows error detection and error correction: the output from lockstep operations can be compared to determine if there has been a fault if there are at least two systems, and the error can be automatically corrected if there are at least three systems, via majority vote. The term "lockstep" originates from army usage, where it refers to synchronized walking, in which marchers walk as closely together as physically practical.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, OR will operate in a defined environment without failure. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.

A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In reliability engineering, dual modular redundancy (DMR) is when components of a system are duplicated, providing redundancy in case one should fail. It is particularly applied to systems where the duplicated components work in parallel, particularly in fault-tolerant computer systems. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.

<span class="mw-page-title-main">Brake-by-wire</span> Automotive technology

Brake-by-wire technology in the automotive industry is the ability to control brakes through electronic means, without a mechanical connection that transfers force to the physical braking system from a driver input apparatus such as a pedal or lever.

<span class="mw-page-title-main">Triple modular redundancy</span> Method for increasing reliability

In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a majority-voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

Triconex is a Schneider Electric brand that supplies products, systems, and services for safety, critical control, and turbo-machinery applications. Triconex also use its name for its hardware devices that use its TriStation application software. Triconex products are based on patented Triple modular redundancy (TMR) industrial safety-shutdown technology. Today, Triconex TMR products operate globally in more than 11,500 installations.

<span class="mw-page-title-main">Single point of failure</span> A part whose failure will disrupt the entire system

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system. If there is an SPOF present in a system, it produces a potential interruption to the system that is substantially more disruptive than an error would elsewhere in the system.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

References

  1. Redundancy Management Technique for Space Shuttle Computers (PDF), IBM Research
  2. R. Jayapal (2003-12-04). "Analog Voting Circuit Is More Flexible Than Its Digital Version". elecdesign.com. Archived from the original on 2007-03-03. Retrieved 2014-06-01.
  3. "The Aerospace Corporation | Assuring Space Mission Success". Aero.org. 2014-05-20. Retrieved 2014-06-01.
  4. 1 2 Scott D. Sagan (March 2004). "Learning from Normal Accidents" (PDF). Organization & Environment. Archived from the original (PDF) on 2004-07-14.
  5. Koren, Israel; Krishna, C. Mani (2007). Fault-Tolerant Systems. San Francisco, CA: Morgan Kaufmann. p. 3. ISBN   978-0-12-088525-1.
  6. Smithsonian Institution | Office of Safety, Health, and Environmental Management | Fire Protection and Life Safety Design ManualIndependent Sources | Facilities with a maximum possible fire loss exceeding $ 50 million must have two independent sources of fire protection water.
  7. Why Dissimilar Redundant Architectures Are a Necessity for DAL A | Curtis Wright Defense Systems ]
  8. Fire Alarm Circuits | A Class X circuit will continue to work with a single open or a single short-circuit by use of a redundant path.
  9. Protecting against the power of lightning | to protect against induced surges rather than direct lightning strikes. Feb 1st, 2005 Twisted pair
  10. 1 2 3 Data Center Site Redundancy | H. M. Brotherton and J. Eric Dietz | Computer Information Technology, Purdue University
  11. Factory Mutual Insurance Company | 1-20 Protection Against Exterior Fire Exposure
  12. 1 2 National Research Council | Canada | Division Of Building Research | Spatial Separation Of Buildlngs | November 1959
  13. Tall Building Design Guidelines | City of Toronto | March 2013 | Page 52 | the separation distance between towers on the same site of 25 meters or more
  14. Protecting Residences From Wildfires | by Howard E. Moore (General Technical Report PSW-50) | page 30, item 10.
  15. On-Premises Cloud Is a Failure. Google Has the Fix | Elias Khnaser | 05/17/2023
  16. https://www.archives.gov/files/records-mgmt/storage-standards-toolkit/file3.pdf Facility Standards for Records Storage Facilities
  17. https://www.archives.gov/preservation/storage/presidential-library-standards.html Standards for Permanent Records Storage and Presidential Libraries
  18. Sandborn, Peter; Lucyshyn, William (2022). System Sustainment: Acquisition And Engineering Processes For The Sustainment Of Critical And Legacy Systems. World Scientific. ISBN   9789811256868.
  19. 1 2 Trivedi, Kishor S.; Bobbio, Andrea (2017). Reliability and Availability Engineering: Modeling, Analysis, and Applications. Cambridge University Press. ISBN   978-1107099500.
  20. Vitillo, Roberto (23 February 2022). Understanding Distributed Systems, Second Edition: What every developer should know about large distributed applications. Roberto Vitillo. ISBN   978-1838430214.