Cascading failure

Last updated
An animation demonstrating how a single failure may result in other failures throughout a network. Networkfailure.gif
An animation demonstrating how a single failure may result in other failures throughout a network.

A cascading failure is a failure in a system of interconnected parts in which the failure of one or few parts leads to the failure of other parts, growing progressively as a result of positive feedback. This can occur when a single part fails, increasing the probability that other portions of the system fail. [1] [2] Such a failure may happen in many types of systems, including power transmission, computer networking, finance, transportation systems, organisms, the human body, and ecosystems.

Contents

Cascading failures may occur when one part of the system fails. When this happens, other parts must then compensate for the failed component. This in turn overloads these nodes, causing them to fail as well, prompting additional nodes to fail one after another.

In power transmission

Cascading failure is common in power grids when one of the elements fails (completely or partially) and shifts its load to nearby elements in the system. Those nearby elements are then pushed beyond their capacity so they become overloaded and shift their load onto other elements. Cascading failure is a common effect seen in high voltage systems, where a single point of failure (SPF) on a fully loaded or slightly overloaded system results in a sudden spike across all nodes of the system. This surge current can induce the already overloaded nodes into failure, setting off more overloads and thereby taking down the entire system in a very short time.

This failure process cascades through the elements of the system like a ripple on a pond and continues until substantially all of the elements in the system are compromised and/or the system becomes functionally disconnected from the source of its load. For example, under certain conditions a large power grid can collapse after the failure of a single transformer.

Monitoring the operation of a system, in real-time, and judicious disconnection of parts can help stop a cascade. Another common technique is to calculate a safety margin for the system by computer simulation of possible failures, to establish safe operating levels below which none of the calculated scenarios is predicted to cause cascading failure, and to identify the parts of the network which are most likely to cause cascading failures. [3]

One of the primary problems with preventing electrical grid failures is that the speed of the control signal is no faster than the speed of the propagating power overload, i.e. since both the control signal and the electrical power are moving at the same speed, it is not possible to isolate the outage by sending a warning ahead to isolate the element.

Examples

Cascading failure caused the following power outages:

In computer networks

Cascading failures can also occur in computer networks (such as the Internet) in which network traffic is severely impaired or halted to or between larger sections of the network, caused by failing or disconnected hardware or software. In this context, the cascading failure is known by the term cascade failure. A cascade failure can affect large groups of people and systems.

The cause of a cascade failure is usually the overloading of a single, crucial router or node, which causes the node to go down, even briefly. It can also be caused by taking a node down for maintenance or upgrades. In either case, traffic is routed to or through another (alternative) path. This alternative path, as a result, becomes overloaded, causing it to go down, and so on. It will also affect systems which depend on the node for regular operation.

Symptoms

The symptoms of a cascade failure include: packet loss and high network latency, not just to single systems, but to whole sections of a network or the internet. The high latency and packet loss is caused by the nodes that fail to operate due to congestion collapse, which causes them to still be present in the network but without much or any useful communication going through them. As a result, routes can still be considered valid, without them actually providing communication.

If enough routes go down because of a cascade failure, a complete section of the network or internet can become unreachable. Although undesired, this can help speed up the recovery from this failure as connections will time out, and other nodes will give up trying to establish connections to the section(s) that have become cut off, decreasing load on the involved nodes.

A common occurrence during a cascade failure is a walking failure, where sections go down, causing the next section to fail, after which the first section comes back up. This ripple can make several passes through the same sections or connecting nodes before stability is restored.

History

Cascade failures are a relatively recent development, with the massive increase in traffic and the high interconnectivity between systems and networks. The term was first applied in this context in the late 1990s by a Dutch IT professional and has slowly become a relatively common term for this kind of large-scale failure.[ citation needed ]

Example

Network failures typically start when a single network node fails. Initially, the traffic that would normally go through the node is stopped. Systems and users get errors about not being able to reach hosts. Usually, the redundant systems of an ISP respond very quickly, choosing another path through a different backbone. The routing path through this alternative route is longer, with more hops and subsequently going through more systems that normally do not process the amount of traffic suddenly offered.

This can cause one or more systems along the alternative route to go down, creating similar problems of their own.

Related systems are also affected in this case. As an example, DNS resolution might fail and what would normally cause systems to be interconnected, might break connections that are not even directly involved in the actual systems that went down. This, in turn, may cause seemingly unrelated nodes to develop problems, that can cause another cascade failure all on its own.

In December 2012, a partial loss (40%) of Gmail service occurred globally, for 18 minutes. This loss of service was caused by a routine update of load balancing software which contained faulty logic—in this case, the error was caused by logic using an inappropriate 'all' instead of the more appropriate 'some'. [4] The cascading error was fixed by fully updating a single node in the network instead of partially updating all nodes at one time.

Cascading structural failure

Certain load-bearing structures with discrete structural components can be subject to the "zipper effect", where the failure of a single structural member increases the load on adjacent members. In the case of the Hyatt Regency walkway collapse, a suspended walkway (which was already overstressed due to an error in construction) failed when a single vertical suspension rod failed, overloading the neighboring rods which failed sequentially (i.e. like a zipper). A bridge that can have such a failure is called fracture critical, and numerous bridge collapses have been caused by the failure of a single part. Properly designed structures use an adequate factor of safety and/or alternate load paths to prevent this type of mechanical cascade failure. [5]

Fracture cascade

Chain reaction of osteoporotic fractures Sequential fractures - Fracture cascade -- Smart-Servier.jpg
Chain reaction of osteoporotic fractures

Fracture cascade is a phenomenon in the context of geology and describes triggering a chain reaction of subsequent fractures by a single fracture. [6] The initial fracture leads to the propagation of additional fractures, causing a cascading effect throughout the material.

Fracture cascades can occur in various materials, including rocks, ice, metals, and ceramics. [7] A common example is the bending of dry spaghetti, which in most cases breaks into more than 2 pieces, as first observed by Richard Feynman. [7]

In the context of osteoporosis, a fracture cascade is the increased risk of subsequent bone fractures after an initial one. [8]

Other examples

Biology

Biochemical cascades exist in biology, where a small reaction can have system-wide implications. One negative example is ischemic cascade, in which a small ischemic attack releases toxins which kill off far more cells than the initial damage, resulting in more toxins being released. Current research is to find a way to block this cascade in stroke patients to minimize the damage.

In the study of extinction, sometimes the extinction of one species will cause many other extinctions to happen. Such a species is known as a keystone species.

Electronics

Another example is the Cockcroft–Walton generator, which can also experience cascade failures wherein one failed diode can result in all the diodes failing in a fraction of a second.

Yet another example of this effect in a scientific experiment was the implosion in 2001 of several thousand fragile glass photomultiplier tubes used in the Super-Kamiokande experiment, where the shock wave caused by the failure of a single detector appears to have triggered the implosion of the other detectors in a chain reaction.

Finance

In finance, the risk of cascading failures of financial institutions is referred to as systemic risk: the failure of one financial institution may cause other financial institutions (its counterparties) to fail, cascading throughout the system. Institutions that are believed to pose systemic risk are deemed either "too big to fail" (TBTF) or "too interconnected to fail" (TICTF), depending on why they appear to pose a threat.

Note however that systemic risk is not due to individual institutions per se, but due to the interconnections. Frameworks to study and predict the effects of cascading failures have been developed in the research literature. [9] [10] [11]

A related (though distinct) type of cascading failure in finance occurs in the stock market, exemplified by the 2010 Flash Crash. [11]

Interdependent cascading failures

Illustration of the interdependent relationship among different infrastructures Interdependent relationship among different infrastructures.tif
Illustration of the interdependent relationship among different infrastructures

Diverse infrastructures such as water supply, transportation, fuel and power stations are coupled together and depend on each other for functioning, see Fig. 1. Owing to this coupling, interdependent networks are extremely sensitive to random failures, and in particular to targeted attacks, such that a failure of a small fraction of nodes in one network can trigger an iterative cascade of failures in several interdependent networks. [12] [13] Electrical blackouts frequently result from a cascade of failures between interdependent networks, and the problem has been dramatically exemplified by the several large-scale blackouts that have occurred in recent years. Blackouts are a fascinating demonstration of the important role played by the dependencies between networks. For example, the 2003 Italy blackout resulted in a widespread failure of the railway network, health care systems, and financial services and, in addition, severely influenced the telecommunication networks. The partial failure of the communication system in turn further impaired the electrical grid management system, thus producing a positive feedback on the power grid. [14] This example emphasizes how inter-dependence can significantly magnify the damage in an interacting network system.

Model for overload cascading failures

A model for cascading failures due to overload propagation is the Motter–Lai model. [15]

See also

Related Research Articles

<span class="mw-page-title-main">Electric power transmission</span> Bulk movement of electrical energy

Electric power transmission is the bulk movement of electrical energy from a generating site, such as a power plant, to an electrical substation. The interconnected lines that facilitate this movement form a transmission network. This is distinct from the local wiring between high-voltage substations and customers, which is typically referred to as electric power distribution. The combined transmission and distribution network is part of electricity delivery, known as the electrical grid.

<span class="mw-page-title-main">Network topology</span> Arrangement of the elements of a communication network

Network topology is the arrangement of the elements of a communication network. Network topology can be used to define or describe the arrangement of various types of telecommunication networks, including command and control radio networks, industrial fieldbusses and computer networks.

<span class="mw-page-title-main">Load balancing (computing)</span> Set of techniques to improve the distribution of workloads across multiple computing resources

In computing, load balancing is the process of distributing a set of tasks over a set of resources, with the aim of making their overall processing more efficient. Load balancing can optimize the response time and avoid unevenly overloading some compute nodes while other compute nodes are left idle.

<span class="mw-page-title-main">Power outage</span> Loss of electric power to an area

A power outage is the loss of the electrical power network supply to an end user.

<span class="mw-page-title-main">Northeast blackout of 1965</span> Major power outage in Northeastern U.S. and Canada

The northeast blackout of 1965 was a significant disruption in the supply of electricity on Tuesday, November 9, 1965, affecting parts of Ontario in Canada and Connecticut, Delaware, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, and Vermont in the United States. Over 30 million people and 80,000 square miles (207,000 km2) were left without electricity for up to 13 hours.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

The 2003 Italy blackout was a serious power outage that affected all of the Italian Peninsula for 12 hours and part of Switzerland near Geneva for 3 hours on 28 September 2003. It was the largest blackout in the series of blackouts in 2003, involving about 56 million people.

<span class="mw-page-title-main">Diesel generator</span> Combination of a diesel engine with an electrical generator

A diesel generator (DG) (also known as a diesel genset) is the combination of a diesel engine with an electric generator (often an alternator) to generate electrical energy. This is a specific case of engine generator. A diesel compression-ignition engine is usually designed to run on diesel fuel, but some types are adapted for other liquid fuels or natural gas (CNG).

<span class="mw-page-title-main">Northeast blackout of 2003</span> Major power outage in August 2003 in North America

The Northeast blackout of 2003 was a widespread power outage throughout parts of the Northeastern and Midwestern United States, and most parts of the Canadian province of Ontario on Thursday, August 14, 2003, beginning just after 4:10 p.m. EDT.

Failing badly and failing well are concepts in systems security and network security describing how a system reacts to failure. The terms have been popularized by Bruce Schneier, a cryptographer and security consultant.

<span class="mw-page-title-main">Load management</span> Process of balancing the supply of electricity on a network

Load management, also known as demand-side management (DSM), is the process of balancing the supply of electricity on the network with the electrical load by adjusting or controlling the load rather than the power station output. This can be achieved by direct intervention of the utility in real time, by the use of frequency sensitive relays triggering the circuit breakers, by time clocks, or by using special tariffs to influence consumer behavior. Load management allows utilities to reduce demand for electricity during peak usage times, which can, in turn, reduce costs by eliminating the need for peaking power plants. In addition, some peaking power plants can take more than an hour to bring on-line which makes load management even more critical should a plant go off-line unexpectedly for example. Load management can also help reduce harmful emissions, since peaking plants or backup generators are often dirtier and less efficient than base load power plants. New load-management technologies are constantly under development — both by private industry and public entities.

<span class="mw-page-title-main">Adilson E. Motter</span> American scientist (born 1974)

Adilson E. Motter is the Charles E. and Emma H. Morrison Professor of Physics at Northwestern University, where he has helped develop the concept of synthetic rescue in network biology as well as methods to control the nonlinear dynamics of complex networks. In joint work with Takashi Nishikawa, he discovered the phenomenon of converse symmetry breaking. Motter's research is focused on complex systems and nonlinear phenomena, primarily involving complex networks, systems biology, chaos and statistical physics.

<span class="mw-page-title-main">Electrical grid</span> Interconnected network for delivering electricity from suppliers to consumers

An electrical grid is an interconnected network for electricity delivery from producers to consumers. Electrical grids consist of power stations, electrical substations to step voltage up or down, electric power transmission to carry power long distances, and lastly electric power distribution to individual customers, where voltage is stepped down again to the required service voltage(s). Electrical grids vary in size and can cover whole countries or continents. From small to large there are microgrids, wide area synchronous grids, and super grids.

<span class="mw-page-title-main">Single point of failure</span> A part whose failure will disrupt the entire system

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

<span class="mw-page-title-main">Structural fracture mechanics</span> Field of structural engineering

Structural fracture mechanics is the field of structural engineering concerned with the study of load-carrying structures that includes one or several failed or damaged components. It uses methods of analytical solid mechanics, structural engineering, safety engineering, probability theory, and catastrophe theory to calculate the load and stress in the structural components and analyze the safety of a damaged structure.

<span class="mw-page-title-main">2011 Southwest blackout</span> Power Outage In Southern California

The 2011 Southwest blackout, also known as the Great Blackout of 2011, was a widespread power outage that affected the San Diego–Tijuana area, southern Orange County, Imperial Valley, Mexicali Valley, Coachella Valley, and parts of Arizona. It occurred on Thursday, September 8, 2011, beginning at about 3:38pm PDT, and was the largest power failure in California history.

<span class="mw-page-title-main">2012 India blackouts</span> Widespread power outages in India

Two severe power outages affected most of northern and eastern India on 30 and 31 July 2012. The 30 July 2012 blackout affected over 400 million people and was briefly the largest power outage in history by number of people affected, beating the January 2001 blackout in Northern India. The blackout on 31 July is the largest power outage in history. The outage affected more than 620 million people, about 9% of the world population, or half of India's population, spread across 22 states in Northern, Eastern, and Northeast India. An estimated 32 gigawatts of generating capacity was taken offline. Of the affected population, 320 million initially had power, while the rest lacked direct access. Electric service was restored in the affected locations between 31 July and 1 August 2012.

<span class="mw-page-title-main">Interdependent networks</span> Subfield of network science

The study of interdependent networks is a subfield of network science dealing with phenomena caused by the interactions between complex networks. Though there may be a wide variety of interactions between networks, dependency focuses on the scenario in which the nodes in one network require support from nodes in another network.

In applied physics, the concept of controlling self-organized criticality refers to the control of processes by which a self-organized system dissipates energy. The objective of the control is to reduce the probability of occurrence of and size of energy dissipation bursts, often called avalanches, of self-organized systems. Dissipation of energy in a self-organized critical system into a lower energy state can be costly for society, since it depends on avalanches of all sizes usually following a kind of power law distribution and large avalanches can be damaging and disruptive.

Robustness, the ability to withstand failures and perturbations, is a critical attribute of many complex systems including complex networks.

References

  1. "Cascading Failure - an overview | ScienceDirect Topics". www.sciencedirect.com.
  2. Ulrich, Mike. "Chapter 22 - Addressing Cascading Failures". Google - Site Reliability Engineering.
  3. Zhai, Chao (2017). "Modeling and Identification of Worst-Case Cascading Failures in Power Systems". arXiv: 1703.05232 [cs.SY].
  4. "Why Gmail went down: Google misconfigured load balancing servers (Updated)". 11 December 2012.
  5. Petroski, Henry (1992). To Engineer Is Human: The Role of Failure in Structural Design . Vintage. ISBN   978-0-679-73416-1.
  6. Boast, P. Baveye, C. W. (1998). "Fractal Geometry, Fragmentation Processes and the Physics of Scale-Invariance: An Introduction". Revival: Fractals in Soil Science (1998). CRC Press. doi:10.1201/9781315151052. ISBN   9781315151052.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  7. 1 2 Heisser, Ronald H.; Patil, Vishal P.; Stoop, Norbert; Villermaux, Emmanuel; Dunkel, Jörn (28 August 2018). "Controlling fracture cascades through twisting and quenching". Proceedings of the National Academy of Sciences. 115 (35): 8665–8670. arXiv: 1802.05402 . Bibcode:2018PNAS..115.8665H. doi: 10.1073/pnas.1802831115 . ISSN   0027-8424. PMC   6126751 . PMID   30104353.
  8. Melton, L Joseph; Amin, Shreyasee (26 June 2013). "Is there a specific fracture 'cascade'?". BoneKEy Reports. 2: 367. doi:10.1038/bonekey.2013.101. PMC   3935254 . PMID   24575296.
  9. Acemoglu, Daron; Ozdaglar, Asuman; Tahbaz-Salehi, Alireza (2015). "Systemic Risk and Stability in Financial Networks". American Economic Review. 105 (2). American Economic Association: 564–608. doi:10.1257/aer.20130456. hdl: 1721.1/100979 . ISSN   0002-8282. S2CID   7447939.
  10. Gai, Prasanna; Kapadia, Sujit (2010-08-08). "Contagion in financial networks". Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 466 (2120): 2401–2423. Bibcode:2010RSPSA.466.2401G. doi:10.1098/rspa.2009.0410. ISSN   1364-5021. S2CID   9945658.
  11. 1 2 Elliott, Matthew; Golub, Benjamin; Jackson, Matthew O. (2014-10-01). "Financial Networks and Contagion". American Economic Review. 104 (10): 3115–3153. doi:10.1257/aer.104.10.3115. ISSN   0002-8282.
  12. "Report of the Commission to Assess the Threat to the United States from Electromagnetic Pulse (EMP) Attack" (PDF).
  13. Rinaldi, S.M.; Peerenboom, J.P.; Kelly, T.K. (2001). "Identifying, understanding, and analyzing critical infrastructure interdependencies". IEEE Control Systems Magazine. 21 (6): 11–25. doi:10.1109/37.969131.
  14. V. Rosato, Issacharoff, L., Tiriticco, F., Meloni, S., Porcellinis, S.D., & Setola, R. (2008). "Modelling interdependent infrastructures using interacting dynamical models". International Journal of Critical Infrastructures. 4: 63–79. doi:10.1504/IJCIS.2008.016092.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  15. Motter, A. E.; Lai, Y. C. (2002). "Cascade-based attacks on complex networks". Phys. Rev. E. 66 (6 Pt 2): 065102. arXiv: cond-mat/0301086 . Bibcode:2002PhRvE..66f5102M. doi:10.1103/PhysRevE.66.065102. PMID   12513335. S2CID   17189308.

Further reading