Active redundancy

Last updated

Active redundancy is a design concept that increases operational availability and that reduces operating cost by automating most critical maintenance actions.

Contents

This concept is related to condition-based maintenance and fault reporting. [1]

History

The initial requirement began with military combat systems during World War I. The approach used for survivability was to install thick armor plate to resist gun fire and install multiple guns.

This became unaffordable and impractical during the Cold War when aircraft and missile systems became common.

The new approach was to build distributed systems that continue to work when components are damaged. This depends upon very crude forms of artificial intelligence that perform reconfiguration by obeying specific rules. An example of this approach is the AN/UYK-43 computer.

Formal design philosophies involving active redundancy are required for critical systems where corrective labor is undesirable or impractical to correct failure during normal operation.

Commercial aircraft are required to have multiple redundant computing systems, hydraulic systems, and propulsion systems so that a single in-flight equipment failure will not cause loss of life.

A more recent outcome of this work is the Internet, which relies on a backbone of routers that provide the ability to automatically re-route communication without human intervention when failures occur.

Satellites placed into orbit around the Earth must include massive active redundancy to ensure operation will continue for a decade or longer despite failures induced by normal failure, radiation-induced failure, and thermal shock.

This strategy now dominates space systems, aircraft, and missile systems.

Principle

Maintenance requires three actions, which usually involve down time and high priority labor costs:

Active redundancy eliminates down time and reduces manpower requirements by automating all three actions. This requires some amount of automated artificial intelligence.

N stands for needed equipment. The amount of excess capacity affects overall system reliability by limiting the effects of failure.

For example, if it takes two generators to power a city, then "N+1" would be three generators to allow a single failure. Similarly, "N+2" would be four generators, which would allow one generator to fail while a second generator has already failed.

Active redundancy improves operational availability as follows.

Passive components

Active redundancy in passive components requires redundant components that share the burden when failure occurs, like in cabling and piping.

This allows forces to be redistributed across a bridge to prevent failure if a vehicle ruptures a cable. [2]

This allows water flow to be redistributed through pipes when a limited number of valves are shut or pumps shut down. [3]

Active components

Active redundancy in active components requires reconfiguration when failure occurs. Computer programming must recognize the failure and automatically reconfigure to restore operation.

All modern computers provide the following when an existing feature is enabled via fault reporting.

Mechanical devices must reconfigure, such as transmission settings on hybrid vehicles that have redundant propulsion systems. The petroleum engine will start up when battery power fails.

Electrical power systems must perform two actions to prevent total system failure when smaller failures occur, such as when a tree falls across a power line. Power systems incorporate communication, switching, and automatic scheduling that allows these actions to be automated.

Benefits

This is the only known strategy that can achieve high availability.

Detriments

This maintenance philosophy requires custom development with extra components.

See also

Related Research Articles

In reliability engineering, the term availability has the following meanings:

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

High-availability clusters are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In computer storage, the standard RAID levels comprise a basic set of RAID configurations that employ the techniques of striping, mirroring, or parity to create large reliable data stores from multiple general-purpose computer hard disk drives (HDDs). The most common types are RAID 0 (striping), RAID 1 (mirroring) and its variants, RAID 5, and RAID 6. Multiple RAID levels can also be combined or nested, for instance RAID 10 or RAID 01. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard. The numerical values only serve as identifiers and do not signify performance, reliability, generation, or any other metric.

<span class="mw-page-title-main">Brake-by-wire</span> Automotive technology

Brake-by-wire technology in the automotive industry is the ability to control brakes through electronic means, without a mechanical connection that transfers force to the physical braking system from a driver input apparatus such as a pedal or lever.

Redundancy is a form of resilience that ensures system availability in the event of component failure. Components have at least one independent backup component (+1). The level of resilience is referred to as active/passive or standby as backup components do not actively participate within the system during normal operation. The level of transparency during failover is dependent on a specific solution, though degradation to system resilience will occur during failover.

High-redundancy actuation (HRA) is a new approach to fault-tolerant control in the area of mechanical actuation.

<span class="mw-page-title-main">Single point of failure</span> A part whose failure will disrupt the entire system

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

<span class="mw-page-title-main">Reliability block diagram</span>

A reliability block diagram (RBD) is a diagrammatic method for showing how component reliability contributes to the success or failure of a redundant. RBD is also known as a dependence diagram (DD).

Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.

Maintenance Philosophy is the mix of strategies that ensure an item works as expected when needed.

Fault reporting is a maintenance concept that increases operational availability and that reduces operating cost by three mechanisms:

Operational availability in systems engineering is a measurement of how long a system has been available to use when compared with how long it should have been available to be used.

High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as *100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.

References

  1. "Opnav Instruction 4790.16: Condition Based Maintenance". US Navy Operations. Archived from the original on 2013-02-15. Retrieved 2012-08-15.
  2. "Bridge System Safety and Redundancy". Transportation Research Board.
  3. "Water Systems". Boston Water and Sewer Commission. Archived from the original on 2012-09-21. Retrieved 2012-08-15.