High reliability organization

Last updated

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Contents

Important case studies in HRO research include both studies of disasters (e.g., Three Mile Island nuclear incident, the Challenger Disaster and Columbia Disaster, the Bhopal chemical leak, the Chernobyl Disaster, the Tenerife air crash, the Mann Gulch forest fire, the Black Hawk friendly fire incident in Iraq) and HROs like the air traffic control system, naval aircraft carriers, and nuclear power operations.

History

HRO theory is derived from normal accident theory, which led a group of researchers at the University of California, Berkeley (Todd LaPorte, Gene Rochlin, and Karlene Roberts) to study how organizations working with complex and hazardous systems operated error free. [1] [2] They researched three organizations: United States nuclear aircraft carriers (in partnership with Rear Admiral (ret.) Tom Mercer on the USS Carl Vinson), the Federal Aviation Administration's Air Traffic Control system (and commercial aviation more generally), and nuclear power operations (Pacific Gas and Electric's Diablo Canyon reactor).

The result of this initial work was the defining characteristics of HROs hold in common: [3]

  1. "Hypercomplexity" – extreme variety of components, systems, and levels.
  2. Tight coupling – reciprocal interdependence across many units and levels.
  3. Distinguishable hierarchy – multiple levels, each with its own elaborate control and regulating mechanisms.
  4. Large numbers of decision makers in complex communication networks – characterized by a thorough network of peer-reviewed control and informational systems.
  5. Discernable degree of accountability that reinforces organizational commitment high quality work – strict adherence to a set performance standard, in which deviation results in additional training or corrective action.
  6. High frequency of immediate feedback about decisions.
  7. Compressed time factors – cycles of major activities are measured in seconds.
  8. More than one critical outcome that must happen simultaneously – simultaneity signifies both the complexity of operations as well as the inability to withdraw or modify operations decisions.

While many organizations display some of these characteristics, HROs display them all simultaneously.

Normal Accident and HRO theorists agreed that interactive complexity and tight coupling can, theoretically, lead to a system accident. However, they hold different opinions on whether those system accidents are inevitable or are manageable. Serious accidents in high risk, hazardous operations can be prevented through a combination of organizational design, culture, management, and human choice. Theorists of both schools place a lot of emphasis on human interaction with the system as either cause (Normal Accident Theory - NAT) or prevention (HRO) of a systems accident. [4] High reliability organization theory and HROs are often contrasted against Charles Perrow's Normal Accident Theory [5] (see Sagan [6] for a comparison of HRO and NAT). NAT represents Perrow's attempt to translate his understanding of the disaster at Three Mile Island nuclear facility into a more general formulation of accidents and disasters. Perrow's 1984 book also included chapters on petrochemical plants, aviation accidents, naval accidents, "earth-based system" accidents (dam breaks, earthquakes), and "exotic" accidents (genetic engineering, military operations, and space flight). [7] At Three Mile Island the technology was tightly coupled due to time-dependent processes, invariant sequences, and limited slack. The technological deficiencies were a result of unforeseen concatenations, that ultimately resulted in the conjoined collapse of a complex system. Perrow hypothesized that regardless of the effectiveness of management and operations, accidents in systems that are characterized by tight coupling and interactive complexity will be normal or inevitable as they often cannot be foreseen or prevented. This view, described by some theorists as unashamedly technologically deterministic, contrasts with the view of HRO proponents, who argued that high-risk, high-hazard organizations can function safely despite the hazards of complex systems. Despite their differences, NAT and HRO theory share a focus on the social and organizational underpinnings of system safety and accident causation/prevention. As research continued, a body of knowledge emerged based on the studying of a variety of organizations. For example, a fire incident command system, [8] Loma Linda Hospital's Pediatric Intensive Care Unit, [9] and the California Independent System Operator [10] were all studied as examples of HROs.

Although they may seem diverse, these organizations have a number of similarities. First, they operate in unforgiving social and political environments. Second, their technologies are risky and present the potential for error. Third, the severity and scale of possible consequences from errors or mistakes precludes learning through experimentation. Finally, these organizations all use complex processes to manage complex technologies and complex work to avoid failure. HROs share many properties with other high-performing organizations including highly trained-personnel, continuous training, effective reward systems, frequent process audits and continuous improvement efforts. Yet other properties such as an organization-wide sense of vulnerability, a widely distributed sense of responsibility and accountability for reliability, concern about misperception, misconception and misunderstanding that is generalized across a wide set of tasks, operations, and assumptions, pessimism about possible failures, redundancy and a variety of checks and counter checks as a precaution against potential mistakes are more distinctive. [11]

Defining high reliability and specifying what constitutes a HRO has presented some challenges. Roberts [12] initially proposed that high reliability organizations are a subset of hazardous organizations that have enjoyed a record of high safety over long periods of time. Specifically she argued that: “One can identify this subset by answering the question, “how many times could this organization have failed resulting in catastrophic consequences that it did not?” If the answer is on the order of tens of thousands of times the organization is “high reliability”” [12] (p. 160). More recent definitions have built on this starting point but emphasized the dynamic nature of producing reliability (i.e., constantly seeking to improve reliability and intervening both to prevent errors and failures and to cope and recover quickly should errors become manifest). Some researchers view HROs as reliability-seeking rather than reliability-achieving. Reliability-seeking organizations are not distinguished by their absolute errors or accident rate, but rather by their “effective management of innately risky technologies through organizational control of both hazard and probability” [13] (p. 14). Consequently, the phrase "high reliability" has come to mean that high risk and high effectiveness can co-exist, for organizations that must perform well under trying conditions, and that it takes intensive effort to do so.

While the early research focused on high risk industries, other expressed interest in HROs and sought to emulate their success. A key turning point was Karl Weick, Kathleen M. Sutcliffe, and David Obstfeld's [14] reconceptualization of the literature on high reliability. These researchers systematically reviewed the case study literature on HROs and illustrated how the infrastructure of high reliability was grounded in processes of collective mindfulness which are indicated by a preoccupation with failure, reluctance to simplify interpretations, sensitivity to operations, commitment to resilience, and deference to expertise. In other words, HROs are distinctive because of their efforts to organize in ways that increase the quality of attention across the organization, thereby enhancing people's alertness and awareness to details so that they can detect subtle ways in which contexts vary and call for contingent responding (i.e., collective mindfulness). This construct was elaborated and refined as mindful organizing in Weick and Sutcliffe's 2001 and 2007 editions of their book Managing the Unexpected. [15] [16] Mindful organizing forms a basis for individuals to interact continuously as they develop, refine and update a shared understanding of the situation they face and their capabilities to act on that understanding. Mindful organizing proactively triggers actions that forestall and contain errors and crises and requires leaders and employees to pay close attention to shaping the social and relational infrastructure of the organization. They establish a set of interrelated organizing processes and practices, which jointly contribute to the system's (e.g., team, unit, organization) overall safety culture.

Characteristics

Successful organizations in high-risk industries continually "reinvent" themselves. For example, when an incident command team realizes what they thought was a garage fire has now changed into a hazardous material incident, they completely restructure their response organization.

There are five characteristics of HROs that have been identified [17] as responsible for the "mindfulness" that keeps them working well when facing unexpected situations.

Preoccupation with failure
HROs treat anomalies as symptoms of a problem with the system. The latent organizational weaknesses that contribute to small errors can also contribute to larger problems, so errors are reported promptly so problems can be found and fixed.
Reluctance to simplify interpretations
HROs take deliberate steps to comprehensively understand the work environment as well as a specific situation. They are cognizant that the operating environment is very complex, so they look across system boundaries to determine the path of problems (where they started, where they may end up) and value a diversity of experience and opinions.
Sensitivity to operations
HROs are continuously sensitive to unexpected changed conditions. They monitor the systems’ safety and security barriers and controls to ensure they remain in place and operate as intended. Situational awareness is extremely important to HROs.
Commitment to resilience
HROs develop the capability to detect, contain, and recover from errors. Errors will happen, but HROs are not paralyzed by them.
Deference to expertise
HROs follow typical communication hierarchy during routine operations, but defer to the person with the expertise to solve the problem during upset conditions. During a crisis, decisions are made at the front line and authority migrates to the person who can solve the problem, regardless of their hierarchical rank.

Although the original research and early application of HRO theory into practice occurred in high risk industries, research covers a wide variety of applications and settings. Health care has been the largest practitioner area for the past several years. [4] The applications of Crew Resource Management is another area of focus for leaders in HROs requiring competent behavior systems measurement and intervention. [18] Wildfires create complex and very dynamic mega-crisis situations across the globe every year. U.S. wildland firefighters, often organized using the Incident Command System into flexible inter-agency incident management teams, are not only called upon to "bring order to chaos" in today's mega-fires, they also are requested on "all-hazard events" like hurricanes, floods and earthquakes. The U.S. Wildland Fire Lessons Learned Center has been providing education and training to the wildland fire community on high reliability since 2002.

HRO behaviors can be developed into high-functioning skills of anticipation and resilience. Learning organizations that strive for high performance in things they can plan for, can become HROs that are able to better manage unexpected events that by definition cannot be planned for.

Notes

  1. Rochlin, Gene I. (1996-06-01). "Reliable Organizations: Present Research and Future Directions". Journal of Contingencies and Crisis Management. 4 (2): 55–59. doi:10.1111/j.1468-5973.1996.tb00077.x. ISSN   1468-5973.
  2. Roberts, K.H. (1989). "New challenges in organizational research: High reliability organizations". Organization & Environment. 3 (2): 111–125. doi:10.1177/108602668900300202.
  3. Roberts, K.H.; Rousseau, D.M. (1989). "Research in nearly failure-free, high-reliability organizations: having the bubble". IEEE Transactions on Engineering Management. 36 (2): 132–139. doi:10.1109/17.18830.
  4. 1 2 Tolk, J.N.; Cantu, J.; Beruvides, M.G. (2013). "High Reliability Organization Research: A Literature Review for Health Care". Engineering Management Journal. 27 (4): 218–237. doi:10.1080/10429247.2015.1105087.
  5. Perrow, C. (1984). Normal Accidents: Living with High-Risk Technologies. New York: Basic Books.
  6. Sagan, S. D. (1993). The Limits of Safety: Organizations, Accidents, and Nuclear Weapons. Princeton, N.J.: Princeton University Press.
  7. HRO research shares interest in complexity and errors with other work including Michael Cohen, James March, and Johan Olson's study of garbage-can decision-making processes, Barry Turner's work on man-made disasters, and Barry Staw, Lance Sandelands, and Jane Dutton's research on "threat-rigidity cycles.
  8. Bigley, G. A., & Roberts, K. H. (2001). The Incident Command System: High-Reliability Organizing for Complex and Volatile Task Environments. Academy of Management Journal, 44(6), 1281-1300.
  9. Madsen, P. M., Desai, V. M., Roberts, K. H., & Wong, D. (2006). Mitigating Hazards Through Continuing Design: The Birth and Evolution of a Pediatric Intensive Care Unit. Organization Science, 17(2), 239-248.
  10. Roe, E., & Schulman, P. R. (2008). High Reliability Management: Operating on the Edge. Palo Alto, CA: Stanford University Press.
  11. Schulman, P. R. (2004). General attributes of safe organizations. Quality and Safety in Health Care. 13, Supplement II, ii39-ii44.
  12. 1 2 Roberts, K. H. (1990). Some Characteristics of High-Reliability Organizations. Organization Science, 1, 160-177.
  13. Rochlin, G. I. (1993). Defining high reliability organizations in practice: A taxonomic prologue. In K. H. Roberts (Ed.). New challenges to understanding organizations (pp. 11-32). New York:Macmillan.
  14. Weick, K. E., Sutcliffe, K. M., & Obstfeld, D. (1999). Organizing for High Reliability: Processes of Collective Mindfulness. In B. M. Staw & L. L. Cummings (Eds.), Research in Organizational Behavior (Vol. 21, pp. 81-123). Greenwich, CT: JAI Press, Inc.
  15. Weick, K. E., & Sutcliffe, K. M. (2001). Managing the Unexpected: Assuring High Performance in an Age of Complexity (1st ed.). San Francisco: Jossey-Bass.
  16. Weick, K. E., & Sutcliffe, K. M. (2007). Managing the Unexpected: Resilient Performance in and Age of Uncertainty, Second Edition. San Francisco, CA: Jossey-Bass.
  17. Weick, Karl E.; Kathleen M. Sutcliffe (2001). Managing the Unexpected - Assuring High Performance in an Age of Complexity . San Francisco, CA, USA: Jossey-Bass. pp.  10–17. ISBN   978-0-7879-5627-1.
  18. Alavosius, M.P.; Houmanfar, R.A; Anbro, S.J.; Burleigh, K.; Hebein, C. (2017). "Leadership and Crew Resource Management in High-Reliability Organizations: A Competency Framework for Measuring Behaviors". Journal of Organizational Behavior Management. 37 (2): 142–170. doi:10.1080/01608061.2017.1325825.

Related Research Articles

<span class="mw-page-title-main">Fault tree analysis</span> Failure analysis system used in safety engineering and reliability engineering

Fault tree analysis (FTA) is a type of failure analysis in which an undesired state of a system is examined. This analysis method is mainly used in safety engineering and reliability engineering to understand how systems can fail, to identify the best ways to reduce risk and to determine event rates of a safety accident or a particular system level (functional) failure. FTA is used in the aerospace, nuclear power, chemical and process, pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk factor identification relating to social service system failure. FTA is also used in software engineering for debugging purposes and is closely related to cause-elimination technique used to detect bugs.

<span class="mw-page-title-main">Nuclear and radiation accidents and incidents</span> Severe disruptive events involving fissile or fusile materials

A nuclear and radiation accident is defined by the International Atomic Energy Agency (IAEA) as "an event that has led to significant consequences to people, the environment or the facility." Examples include lethal effects to individuals, large radioactivity release to the environment, or a reactor core melt. The prime example of a "major nuclear accident" is one in which a reactor core is damaged and significant amounts of radioactive isotopes are released, such as in the Chernobyl disaster in 1986 and Fukushima nuclear disaster in 2011.

In the field of human factors and ergonomics, human reliability is the probability that a human performs a task to a sufficient standard. Reliability of humans can be affected by many factors such as age, physical health, mental state, attitude, emotions, personal propensity for certain mistakes, and cognitive biases.

<span class="mw-page-title-main">Redundancy (engineering)</span> Duplication of critical components to increase reliability of a system

In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.

<span class="mw-page-title-main">Safety culture</span> Attitude, beliefs, perceptions and values that employees share in relation to risks in the workplace

Safety culture is the collection of the beliefs, perceptions and values that employees share in relation to risks within an organization, such as a workplace or community. Safety culture is a part of organizational culture, and has been described in a variety of ways, notably the National Academies of Science and the Association of Land Grant and Public Universities have published summaries on this topic in 2014 and 2016.

Human error is an action that has been done but that was "not intended by the actor; not desired by a set of rules or an external observer; or that led the task or system outside its acceptable limits". Human error has been cited as a primary cause contributing factor in disasters and accidents in industries as diverse as nuclear power, aviation, space exploration, and medicine. Prevention of human error is generally seen as a major contributor to reliability and safety of (complex) systems. Human error is one of the many contributing causes of risk events.

Charles B. Perrow was a professor of sociology at Yale University and visiting professor at Stanford University. He authored several books and many articles on organizations, and was primarily concerned with the impact of large organizations on society.

Complexity theory and organizations, also called complexity strategy or complex adaptive organizations, is the use of the study of complexity systems in the field of strategic management and organizational studies. It draws from research in the natural sciences that examines uncertainty and non-linearity. Complexity theory emphasizes interactions and the accompanying feedback loops that constantly change systems. While it proposes that systems are unpredictable, they are also constrained by order-generating rules.

Karl Edward Weick is an American organizational theorist who introduced the concepts of "loose coupling", "mindfulness", and "sensemaking" into organizational studies. He is the Rensis Likert Distinguished University Professor at the Ross School of Business at the University of Michigan.

A system accident is an "unanticipated interaction of multiple failures" in a complex system. This complexity can either be of technology or of human organizations and is frequently both. A system accident can be easy to see in hindsight, but extremely difficult in foresight because there are simply too many action pathways to seriously consider all of them. Charles Perrow first developed these ideas in the mid-1980s. Safety systems themselves are sometimes the added complexity which leads to this type of accident.

Sensemaking or sense-making is the process by which people give meaning to their collective experiences. It has been defined as "the ongoing retrospective development of plausible images that rationalize what people are doing". The concept was introduced to organizational studies by Karl E. Weick in the late 1960's and has affected both theory and practice. Weick intended to encourage a shift away from the traditional focus of organization theorists on decision-making and towards the processes that constitute the meaning of the decisions that are enacted in behavior.

<span class="mw-page-title-main">Nuclear safety in the United States</span> US safety regulations for nuclear power and weapons

Nuclear safety in the United States is governed by federal regulations issued by the Nuclear Regulatory Commission (NRC). The NRC regulates all nuclear plants and materials in the United States except for nuclear plants and materials controlled by the U.S. government, as well those powering naval vessels.

<i>Normal Accidents</i> 1984 book by Charles Perrow

Normal Accidents: Living with High-Risk Technologies is a 1984 book by Yale sociologist Charles Perrow, which analyses complex systems from a sociological perspective. Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems, and that accidents are unavoidable and cannot be designed around.

The healthcare error proliferation model is an adaptation of James Reason’s Swiss Cheese Model designed to illustrate the complexity inherent in the contemporary healthcare delivery system and the attribution of human error within these systems. The healthcare error proliferation model explains the etiology of error and the sequence of events typically leading to adverse outcomes. This model emphasizes the role organizational and external cultures contribute to error identification, prevention, mitigation, and defense construction.

Organizational safety is a contemporary discipline of study and research developed from the works of James Reason, creator of the Swiss cheese model, and Charles Perrow author of Normal Accidents. These scholars demonstrated the complexity and system coupling inherent in organizations, created by multiple process and various people working simultaneously to achieve organizational objectives, is responsible for errors ranging from small to catastrophic system failures. The discipline crosses professions, spans industries, and involves multiple academic domains. As such, the literature is disjointed and the associated research outcomes vary by study setting. This page provides a comprehensive yet concise summary of safety and accidents organizational knowledge using internal links, external links, and seminal literature citations.

<span class="mw-page-title-main">Accident</span> Unforeseen event, often with a negative outcome

An accident is an unintended, normally unwanted event that was not directly caused by humans. The term accident implies that nobody should be blamed, but the event may have been caused by unrecognized or unaddressed risks. Most researchers who study unintentional injury avoid using the term accident and focus on factors that increase risk of severe injury and that reduce injury incidence and severity. For example, when a tree falls down during a wind storm, its fall may not have been caused by humans, but the tree's type, size, health, location, or improper maintenance may have contributed to the result. Most car wrecks are not true accidents; however, English speakers started using that word in the mid-20th century as a result of media manipulation by the US automobile industry.

A cosmology episode is a sudden loss of meaning, followed eventually by a transformative pivot, which creates the conditions for revised meaning.

<span class="mw-page-title-main">Kathleen M. Sutcliffe</span> American academic (born 1950)

Kathleen Sutcliffe is a Bloomberg Distinguished Professor of Medicine and Business at the Johns Hopkins University Carey Business School and School of Medicine and the Gilbert and Ruth Whitaker Professor Emerita of Business Administration at the University of Michigan Ross School of Business. She studies high-reliability organizations and group decision making in order to understand how organizations and their members cope with uncertainty and unexpected events, with a focus on reliability, resilience, and safety in health care.

A defence in depth uses multi-layered protections, similar to redundant protections, to create a reliable system despite any one layer's unreliability.

<i>Meltdown</i> (Clearfield and Tilcsik book)

Meltdown: Why Our Systems Fail and What We Can Do About It is a non-fiction book by Chris Clearfield and András Tilcsik, published in March 2018 by Penguin Press. It explores how complexity causes problems in modern systems and how individuals, organizations, and societies can prevent or mitigate the resulting failures. Meltdown was named a best book of the year by the Financial Times and won Canada's National Business Book Award in 2019.