System accident

Last updated

A system accident (or normal accident) is an "unanticipated interaction of multiple failures" in a complex system. [1] This complexity can either be of technology or of human organizations and is frequently both. A system accident can be easy to see in hindsight, but extremely difficult in foresight because there are simply too many action pathways to seriously consider all of them. Charles Perrow first developed these ideas in the mid-1980s. [2] Safety systems themselves are sometimes the added complexity which leads to this type of accident. [3]

Contents

Pilot and author William Langewiesche used Perrow's concept in his analysis of the factors at play in a 1996 aviation disaster. He wrote in The Atlantic in 1998: "the control and operation of some of the riskiest technologies require organizations so complex that serious failures are virtually guaranteed to occur." [4] [lower-alpha 1]

Characteristics and overview

In 2012 Charles Perrow wrote, "A normal accident [system accident] is where everyone tries very hard to play safe, but unexpected interaction of two or more failures (because of interactive complexity), causes a cascade of failures (because of tight coupling)." Perrow uses the term normal accident to emphasize that, given the current level of technology, such accidents are highly likely over a number of years or decades. [5] James Reason extended this approach with human reliability [6] and the Swiss cheese model, now widely accepted in aviation safety and healthcare.

These accidents often resemble Rube Goldberg devices in the way that small errors of judgment, flaws in technology, and insignificant damages combine to form an emergent disaster. Langewiesche writes about, "an entire pretend reality that includes unworkable chains of command, unlearnable training programs, unreadable manuals, and the fiction of regulations, checks, and controls." [4] The more formality and effort to get it exactly right, at times can actually make failure more likely. [4] [lower-alpha 2] For example, employees are more likely to delay reporting any changes, problems, and unexpected conditions, wherever organizational procedures involved in adjusting to changing conditions are complex, difficult, or laborious.

A contrasting idea is that of the high reliability organization. [7] In his assessment of the vulnerabilities of complex systems, Scott Sagan, for example, discusses in multiple publications their robust reliability, especially regarding nuclear weapons. The Limits of Safety (1993) provided an extensive review of close calls during the Cold War that could have resulted in a nuclear war by accident. [8]

System accident examples

Apollo 13

The Apollo 13 Review Board stated in the introduction to chapter five of their report: [emphasis added] [9]

... It was found that the accident was not the result of a chance malfunction in a statistical sense, but rather resulted from an unusual combination of mistakes, coupled with a somewhat deficient and unforgiving design...

  • (g): In reviewing these procedures before the flight, officials of NASA, ER, and Beech did not recognize the possibility of damage due to overheating. Many of these officials were not aware of the extended heater operation. In any event, adequate thermostatic switches might have been expected to protect the tank.

Three Mile Island accident

Perrow considered the Three Mile Island accident normal: [10]

It resembled other accidents in nuclear plants and in other high risk, complex and highly interdependent operator-machine systems; none of the accidents were caused by management or operator ineptness or by poor government regulation, though these characteristics existed and should have been expected. I maintained that the accident was normal, because in complex systems there are bound to be multiple faults that cannot be avoided by planning and that operators cannot immediately comprehend.

ValuJet Flight 592

On May 11, 1996, Valujet Flight 592, a regularly scheduled ValuJet Airlines flight from Miami International to Hartsfield–Jackson Atlanta, crashed about 10 minutes after taking off as a result of a fire in the cargo compartment caused by improperly stored and labeled hazardous cargo. All 110 people on board died. The airline had a poor safety record before the crash. The accident brought widespread attention to the airline's management problems, including inadequate training of employees in proper handling of hazardous materials. The maintenance manual for the MD-80 aircraft documented the necessary procedures and was "correct" in a sense. However, it was so huge that it was neither helpful nor informative. [4]

Financial crises and investment losses

In a 2014 monograph, economist Alan Blinder stated that complicated financial instruments made it hard for potential investors to judge whether the price was reasonable. In a section entitled "Lesson # 6: Excessive complexity is not just anti-competitive, it's dangerous", he further stated, "But the greater hazard may come from opacity. When investors don't understand the risks that inhere in the securities they buy (examples: the mezzanine tranche of a CDO-Squared; a CDS on a synthetic CDO  ...), big mistakes can be made–especially if rating agencies tell you they are triple-A, to wit, safe enough for grandma. When the crash comes, losses may therefore be much larger than investors dreamed imaginable. Markets may dry up as no one knows what these securities are really worth. Panic may set in. Thus complexity per se is a source of risk." [11]

Continuing challenges

Air transport safety

Despite a significant increase in airplane safety since 1980s, there is concern that automated flight systems have become so complex that they both add to the risks that arise from overcomplication and are incomprehensible to the crews who must work with them. As an example, professionals in the aviation industry note that such systems sometimes switch or engage on their own; crew in the cockpit are not necessarily privy to the rationale for their auto-engagement, causing perplexity. Langewiesche cites industrial engineer Nadine Sarter who writes about "automation surprises," often related to system modes the pilot does not fully understand or that the system switches to on its own. In fact, one of the more common questions asked in cockpits today is, "What's it doing now?" In response to this, Langewiesche points to the fivefold increase in aviation safety and writes, "No one can rationally advocate a return to the glamour of the past." [12]

In an article entitled "The Human Factor", Langewiesche discusses the 2009 crash of Air France Flight 447 over the mid-Atlantic. He points out that, since the 1980s when the transition to automated cockpit systems began, safety has improved fivefold. Langwiesche writes, "In the privacy of the cockpit and beyond public view, pilots have been relegated to mundane roles as system managers." He quotes engineer Earl Wiener who takes the humorous statement attributed to the Duchess of Windsor that one can never be too rich or too thin, and adds "or too careful about what you put into a digital flight-guidance system." Wiener says that the effect of automation is typically to reduce the workload when it is light, but to increase it when it's heavy.

Boeing Engineer Delmar Fadden said that once capacities are added to flight management systems, they become impossibly expensive to remove because of certification requirements. But if unused, may in a sense lurk in the depths unseen. [12]

Theory and practice interplay

Human factors in the implementation of safety procedures play a role in overall effectiveness of safety systems. Maintenance problems are common with redundant systems. Maintenance crews can fail to restore a redundant system to active status. They may be overworked, or maintenance deferred due to budget cuts, because managers know that they system will continue to operate without fixing the backup system. [3] Steps in procedures may be changed and adapted in practice, from the formal safety rules, often in ways that seem appropriate and rational, and may be essential in meeting time constraints and work demands. In a 2004 Safety Science article, reporting on research partially supported by National Science Foundation and NASA, Nancy Leveson writes: [13]

However, instructions and written procedures are almost never followed exactly as operators strive to become more efficient and productive and to deal with time pressures ... even in such highly constrained and high-risk environments as nuclear power plants, modification of instructions is repeatedly found and the violation of rules appears to be quite rational, given the actual workload and timing constraints under which the operators must do their job. In these situations, a basic conflict exists between error as seen as a deviation from the normative procedure and error as seen as a deviation from the rational and normally used effective procedure.

See also

Notes

  1. In the same article, Langewiesche continued: [emphasis added] [4]
    Charles Perrow's thinking is more difficult for pilots like me to accept. Perrow came unintentionally to his theory about normal accidents after studying the failings of large organizations. His point is not that some technologies are riskier than others, which is obvious, but that the control and operation of some of the riskiest technologies require organizations so complex that serious failures are virtually guaranteed to occur. Those failures will occasionally combine in unforeseeable ways, and if they induce further failures in an operating environment of tightly interrelated processes, the failures will spin out of control, defeating all interventions.
    William Langewiesche (March 1998), "The Lessons of Valujet 592", p. 23 [Section: "A 'Normal Accident'"], The Atlantic
  2. See especially the last three paragraphs of this 30-plus-page Atlantic article: "... Understanding why might keep us from making the system even more complex, and therefore perhaps more dangerous, too." [4]

Sources

Related Research Articles

<span class="mw-page-title-main">Safety-critical system</span> System whose failure would be serious

A safety-critical system or life-critical system is a system whose failure or malfunction may result in one of the following outcomes:

AirTran Airways was a low-cost airline in the United States that operated from 1993 to 2014 when it was merged into Southwest Airlines.

<span class="mw-page-title-main">ValuJet Flight 592</span> 1996 passenger plane crash in Florida, US

ValuJet Airlines Flight 592 was a regularly scheduled flight from Miami to Atlanta. On May 11, 1996, the ValuJet Airlines McDonnell Douglas DC-9 operating the route crashed into the Everglades about 10 minutes after departing Miami as a result of a fire in the cargo compartment possibly caused by mislabeled and improperly stored hazardous cargo. All 110 people on board were killed. The airline already had a poor safety record before the crash, and the accident brought widespread attention to the airline's problems. ValuJet's fleet was grounded for several months after the accident. When operations resumed, the airline was unable to attract as many customers as it had before the accident. It acquired AirTran Airways in 1997, but the lingering damage to the ValuJet name led its executives to assume the AirTran name. It is also the deadliest plane crash in Florida as of now.

<span class="mw-page-title-main">Aviation safety</span> State in which risks associated with aviation are at an acceptable level

Aviation safety is the study and practice of managing risks in aviation. This includes preventing aviation accidents and incidents through research, educating air travel personnel, passengers and the general public, as well as the design of aircraft and aviation infrastructure. The aviation industry is subject to significant regulation and oversight.

<span class="mw-page-title-main">Checklist</span> Aide-memoire to ensure consistency and completeness in carrying out a task

A checklist is a type of job aid used in repetitive tasks to reduce failure by compensating for potential limits of human memory and attention. Checklists are used both to ensure that safety-critical system preparations are carried out completely and in the correct order, and in less critical applications to ensure that no step is left out of a procedure. they help to ensure consistency and completeness in carrying out a task. A basic example is the "to do list". A more advanced checklist would be a schedule, which lays out tasks to be done according to time of day or other factors, or a pre-flight checklist for an airliner, which should ensure a safe take-off.

Crew resource management or cockpit resource management (CRM) is a set of training procedures for use in environments where human error can have devastating effects. CRM is primarily used for improving aviation safety and focuses on interpersonal communication, leadership, and decision making in aircraft cockpits. Its founder is David Beaty, a former Royal Air Force and a BOAC pilot who wrote "The Human Factor in Aircraft Accidents" (1969). Despite the considerable development of electronic aids since then, many principles he developed continue to prove effective.

In the field of human factors and ergonomics, human reliability is the probability that a human performs a task to a sufficient standard. Reliability of humans can be affected by many factors such as age, physical health, mental state, attitude, emotions, personal propensity for certain mistakes, and cognitive biases.

<span class="mw-page-title-main">ValuJet Airlines</span> Defunct ultra low-cost airline of the United States (1992–1997)

ValuJet Airlines was an ultra low-cost airline in the United States that operated from 1992 to 1997 when it was rebranded as AirTran Airlines after joining forces with AirTran Airways. It was headquartered in unincorporated Clayton County, Georgia, that operated regularly scheduled domestic and international flights in the Eastern United States and Canada during the 1990s. The company was founded in 1992 and was notorious for its sometimes dangerous cost-cutting measures. All of the airline's planes were purchased used from other airlines; very little training was provided to workers; and contractors were used for maintenance and other services. The company quickly developed a reputation for safety issues. In 1995, the military refused ValuJet's bid to fly military personnel over safety worries, and officials at the FAA wanted the airline to be grounded.

<span class="mw-page-title-main">Pilot error</span> Decision, action or inaction by a pilot of an aircraft

Pilot error generally refers to an accident in which an action or decision made by the pilot was the cause or a contributing factor that led to the accident, but also includes the pilot's failure to make a correct decision or take proper action. Errors are intentional actions that fail to achieve their intended outcomes. The Chicago Convention defines the term "accident" as "an occurrence associated with the operation of an aircraft [...] in which [...] a person is fatally or seriously injured [...] except when the injuries are [...] inflicted by other persons." Hence the definition of "pilot error" does not include deliberate crashing.

<span class="mw-page-title-main">Safety culture</span> Attitude, beliefs, perceptions and values that employees share in relation to risks in the workplace

Safety culture is the collection of the beliefs, perceptions and values that employees share in relation to risks within an organization, such as a workplace or community. Safety culture is a part of organizational culture, and has been described in a variety of ways, notably the National Academies of Science and the Association of Land Grant and Public Universities have published summaries on this topic in 2014 and 2016.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Charles B. Perrow was a professor of sociology at Yale University and visiting professor at Stanford University. He authored several books and many articles on organizations, and was primarily concerned with the impact of large organizations on society.

<i>Normal Accidents</i> 1984 book by Charles Perrow

Normal Accidents: Living with High-Risk Technologies is a 1984 book by Yale sociologist Charles Perrow, which analyses complex systems from a sociological perspective. Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems, and that accidents are unavoidable and cannot be designed around.

<span class="mw-page-title-main">EgyptAir Flight 990</span> 1999 plane crash of a Boeing 767 in the Atlantic Ocean

EgyptAir Flight 990 (MS990/MSR990) was a scheduled flight from Los Angeles International Airport to Cairo International Airport, with a stop at John F. Kennedy International Airport, New York City. On October 31, 1999, the Boeing 767-300ER operating the route crashed into the Atlantic Ocean about 60 miles (100 km) south of Nantucket Island, Massachusetts, killing all 217 passengers and crew on board, making it the deadliest aviation disaster for EgyptAir, and also the second-deadliest aviation accident involving a Boeing 767 aircraft, behind Lauda Air Flight 004.

Organizational safety is a contemporary discipline of study and research developed from the works of James Reason, creator of the Swiss cheese model, and Charles Perrow author of Normal Accidents. These scholars demonstrated the complexity and system coupling inherent in organizations, created by multiple process and various people working simultaneously to achieve organizational objectives, is responsible for errors ranging from small to catastrophic system failures. The discipline crosses professions, spans industries, and involves multiple academic domains. As such, the literature is disjointed and the associated research outcomes vary by study setting. This page provides a comprehensive yet concise summary of safety and accidents organizational knowledge using internal links, external links, and seminal literature citations.

<span class="mw-page-title-main">Preflight checklist</span> List of tasks performed prior to takeoff

In aviation, a preflight checklist is a list of tasks that should be performed by pilots and aircrew prior to takeoff. Its purpose is to improve flight safety by ensuring that no important tasks are forgotten. Failure to correctly conduct a preflight check using a checklist is a major contributing factor to aircraft accidents.

<span class="mw-page-title-main">Accident</span> Unforeseen event, often with a negative outcome

An accident is an unintended, normally unwanted event that was not directly caused by humans. The term accident implies that nobody should be blamed, but the event may have been caused by unrecognized or unaddressed risks. Most researchers who study unintentional injury avoid using the term accident and focus on factors that increase risk of severe injury and that reduce injury incidence and severity. For example, when a tree falls down during a wind storm, its fall may not have been caused by humans, but the tree's type, size, health, location, or improper maintenance may have contributed to the result. Most car wrecks are not true accidents; however, English speakers started using that word in the mid-20th century as a result of media manipulation by the US automobile industry.

Aviation accident analysis is performed to determine the cause of errors once an accident has happened. In the modern aviation industry, it is also used to analyze a database of past accidents in order to prevent an accident from happening. Many models have been used not only for the accident investigation but also for educational purpose.

A defence in depth uses multi-layered protections, similar to redundant protections, to create a reliable system despite any one layer's unreliability.

<i>Meltdown</i> (Clearfield and Tilcsik book)

Meltdown: Why Our Systems Fail and What We Can Do About It is a non-fiction book by Chris Clearfield and András Tilcsik, published in March 2018 by Penguin Press. It explores how complexity causes problems in modern systems and how individuals, organizations, and societies can prevent or mitigate the resulting failures. Meltdown was named a best book of the year by the Financial Times and won Canada's National Business Book Award in 2019.

References

  1. Perrow 1999, p. 70.
  2. Perrow 1984.
  3. 1 2 Perrow 1999.
  4. 1 2 3 4 5 6 Langewiesche, William (1 March 1998). "The Lessons of ValuJet 592". The Atlantic.
  5. Perrow, Charles (December 2012). "Getting to Catastrophe: Concentrations, Complexity and Coupling". The Montréal Review.
  6. Reason, James (1990-10-26). Human Error. Cambridge University Press. ISBN   0-521-31419-4.
  7. Christianson, Marlys K; Sutcliffe, Kathleen M; Miller, Melissa A; Iwashyna, Theodore J (2011). "Becoming a high reliability organization". Critical Care. 15 (6): 314. doi: 10.1186/cc10360 . PMC   3388695 . PMID   22188677.
  8. Sagan, Scott D. (1993). The Limits of Safety: Organizations, Accidents, and Nuclear Weapons. Princeton University Press. ISBN   0-691-02101-5.
  9. Chair Edgar M. Cortright. "Chapter 5. Findings, Determinations, and Recommendations". REPORT OF APOLLO 13 REVIEW BOARD ("Cortright Report") (Report).
  10. Perrow, Charles (1982). "16. The President's Commission and the Normal Accident". In David L. Sills; C. P. Wolf; Vivien B. Shelanski (eds.). Accident at Three Mile Island : The human dimensions. Boulder, Colorado, U.S: Westview Press. pp. 173–184. ISBN   978-0-86531-165-7.
  11. Blinder, Alan S. (November 2014). "What Did We Learn from the Financial Crisis, the Great Recession, and the Pathetic Recovery?" (PDF). Griswold Center for Economic Policy Studies Working Papers. Princeton University. No. 243.
  12. 1 2 Langewiesche, William (September 17, 2014). "The Human Factor - Should Airplanes Be Flying Themselves?". Vanity Fair. ... pilots have been relegated to mundane roles as system managers ... Since the 1980s, when the shift began, the safety record has improved fivefold, to the current one fatal accident for every five million departures. No one can rationally advocate a return to the glamour of the past.
  13. Leveson, Nancy (April 2004). "A New Accident Model for Engineering Safer Systems" (PDF). Safety Science. 42 (4): 237–270. doi:10.1016/S0925-7535(03)00047-X. ... In fact, a common way for workers to apply pressure to management without actually going out on strike is to 'work to rule,' which can lead to a breakdown in productivity and even chaos ...
    • Citing: Rasmussen, Jens; Pejtersen, Annelise Mark; Goodstein, L. P. (1994). Cognitive systems engineering. New York: Wiley. ISBN   978-0-471-01198-9.

Further reading