This article needs additional citations for verification .(February 2013) |
Type of variation | Synonyms |
---|---|
Common cause | Chance cause Non-assignable cause Noise Natural pattern Random effects Random error |
Special cause | Assignable cause Signal Unnatural pattern Systematic effects Systematic error |
Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.
The distinction is fundamental in philosophy of statistics and philosophy of probability, with different treatment of these issues being a classic issue of probability interpretations, being recognised and discussed as early as 1703 by Gottfried Leibniz; various alternative names have been used over the years. The distinction has been particularly important in the thinking of economists Frank Knight, John Maynard Keynes and G. L. S. Shackle.
In 1703, Jacob Bernoulli wrote to Gottfried Leibniz to discuss their shared interest in applying mathematics and probability to games of chance. Bernoulli speculated whether it would be possible to gather mortality data from gravestones and thereby calculate, by their existing practice, the probability of a man currently aged 20 years outliving a man aged 60 years. Leibniz replied that he doubted this was possible:
Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary.
This captures the central idea that some variation is predictable, at least approximately in frequency. This common-cause variation is evident from the experience base. However, new, unanticipated, emergent or previously neglected phenomena (e.g. "new diseases") result in variation outside the historical experience base. Shewhart and Deming argued that such special-cause variation is fundamentally unpredictable in frequency of occurrence or in severity.
John Maynard Keynes emphasised the importance of special-cause variation when he wrote:
By "uncertain" knowledge ... I do not mean merely to distinguish what is known for certain from what is only probable. The game of roulette is not subject, in this sense, to uncertainty ... The sense in which I am using the term is that in which the prospect of a European war is uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of a new invention ... About these matters there is no scientific basis on which to form any calculable probability whatever. We simply do not know!
Common-cause variation is characterised by:
The outcomes of a perfectly balanced roulette wheel are a good example of common-cause variation. Common-cause variation is the noise within the system.
Walter A. Shewhart originally used the term chance cause. [1] The term common cause was coined by Harry Alpert in 1947. The Western Electric Company used the term natural pattern. [2] Shewhart called a process that features only common-cause variation as being in statistical control . This term is deprecated by some modern statisticians who prefer the phrase stable and predictable.
Special-cause variation is characterised by:
Special-cause variation always arrives as a surprise. It is the signal within a system.
Walter A. Shewhart originally used the term assignable cause. [3] The term special-cause was coined by W. Edwards Deming. The Western Electric Company used the term unnatural pattern. [2]
In economics, this circle of ideas is analysed under the rubric of "Knightian uncertainty". John Maynard Keynes and Frank Knight both discussed the inherent unpredictability of economic systems in their work and used it to criticise the mathematical approach to economics, in terms of expected utility, developed by Ludwig von Mises and others. Keynes in particular argued that economic systems did not automatically tend to the equilibrium of full employment owing to their agents' inability to predict the future. As he remarked in The General Theory of Employment, Interest and Money :
... as living and moving beings, we are forced to act ... [even when] our existing knowledge does not provide a sufficient basis for a calculated mathematical expectation.
Keynes' thinking was at odds with the classical liberalism of the Austrian School of economists, but G. L. S. Shackle recognised the importance of Keynes's insight and sought to formalise it within a free-market philosophy.
In financial economics, the black swan theory is based on the significance and unpredictability of special causes.
A special-cause failure is a failure that can be corrected by changing a component or process, whereas a common-cause failure is equivalent to noise in the system and specific actions cannot be made to prevent the failure.
Harry Alpert observed:
A riot occurs in a certain prison. Officials and sociologists turn out a detailed report about the prison, with a full explanation of why and how it happened here, ignoring the fact that the causes were common to a majority of prisons, and that the riot could have happened anywhere.
Alpert recognises that there is a temptation to react to an extreme outcome and to see it as significant, even where its causes are common to many situations and the distinctive circumstances surrounding its occurrence, the results of mere chance. Such behaviour has many implications within management, often leading to ad hoc interventions that merely increase the level of variation and frequency of undesirable outcomes.
Deming and Shewhart both advocated the control chart as a means of managing a business process in an economically efficient manner.
Within the frequency probability framework, there is no process whereby a probability can be attached to the future occurrence of special cause.[ citation needed ] One might naively ask whether the Bayesian approach does allow such a probability to be specified. The existence of special-cause variation led Keynes and Deming to an interest in Bayesian probability, but no formal synthesis emerged from their work. Most statisticians of the Shewhart-Deming school take the view that special causes are not embedded in either experience or in current thinking (that's why they come as a surprise; their prior probability has been neglected—in effect, assigned the value zero) so that any subjective probability is doomed to be hopelessly badly calibrated in practice.
It is immediately apparent from the Leibniz quote above that there are implications for sampling. Deming observed that in any forecasting activity, the population is that of future events while the sampling frame is, inevitably, some subset of historical events. Deming held that the disjoint nature of population and sampling frame was inherently problematic once the existence of special-cause variation was admitted, rejecting the general use of probability and conventional statistics in such situations. He articulated the difficulty as the distinction between analytic and enumerative statistical studies.
Shewhart argued that, as processes subject to special-cause variation were inherently unpredictable, the usual techniques of probability could not be used to separate special-cause from common-cause variation. He developed the control chart as a statistical heuristic to distinguish the two types of variation. Both Deming and Shewhart advocated the control chart as a means of assessing a process's state of statistical control and as a foundation for forecasting.
Keynes identified three domains of probability: [5]
and sought to base a probability theory thereon.
This section possibly contains original research .(February 2013) |
Common mode failure has a more specific meaning in engineering. It refers to events which are not statistically independent. Failures in multiple parts of a system may be caused by a single fault, particularly random failures due to environmental conditions or aging. An example is when all of the pumps for a fire sprinkler system are located in one room. If the room becomes too hot for the pumps to operate, they will all fail at essentially the same time, from one cause (the heat in the room). [6] Another example is an electronic system wherein a fault in a power supply injects noise onto a supply line, causing failures in multiple subsystems.
This is particularly important in safety-critical systems using multiple redundant channels. If the probability of failure in one subsystem is p, then it would be expected that an N channel system would have a probability of failure of pN. However, in practice, the probability of failure is much higher because they are not statistically independent; for example ionizing radiation or electromagnetic interference (EMI) may affect all the channels. [7]
The principle of redundancy states that, when events of failure of a component are statistically independent, the probabilities of their joint occurrence multiply. [8] Thus, for instance, if the probability of failure of a component of a system is one in one thousand per year, the probability of the joint failure of two of them is one in one million per year, provided that the two events are statistically independent. This principle favors the strategy of the redundancy of components. One place this strategy is implemented is in RAID 1, where two hard disks store a computer's data redundantly.
But even so, a system can have many common modes of failure. For example, consider the common modes of failure of a RAID1 where two disks are purchased from an online store and installed in a computer:
Also, if the events of failure of two components are maximally statistically dependent, the probability of the joint failure of both is identical to the probability of failure of them individually. In such a case, the advantages of redundancy are negated. Strategies for the avoidance of common mode failures include keeping redundant components physically isolated.
A prime example of redundancy with isolation is a nuclear power plant. [9] [10] The new ABWR has three divisions of Emergency Core Cooling Systems, each with its own generators and pumps and each isolated from the others. The new European Pressurized Reactor has two containment buildings, one inside the other. However, even here it is possible for a common mode failure to occur (for example, in the Fukushima Daiichi Nuclear Power Plant, mains power was severed by the Tōhoku earthquake, then the thirteen backup diesel generators were all simultaneously disabled by the subsequent tsunami that flooded the basements of the turbine halls).
Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to industrial engineering/systems engineering, and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.
RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).
Fault tree analysis (FTA) is a type of failure analysis in which an undesired state of a system is examined. This analysis method is mainly used in safety engineering and reliability engineering to understand how systems can fail, to identify the best ways to reduce risk and to determine event rates of a safety accident or a particular system level (functional) failure. FTA is used in the aerospace, nuclear power, chemical and process, pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk factor identification relating to social service system failure. FTA is also used in software engineering for debugging purposes and is closely related to cause-elimination technique used to detect bugs.
William Edwards Deming was an American business theorist, composer, economist, industrial engineer, management consultant, statistician, and writer. Educated initially as an electrical engineer and later specializing in mathematical physics, he helped develop the sampling techniques still used by the United States Census Bureau and the Bureau of Labor Statistics. He is also known as the father of the quality movement and was hugely influential in post-WWII Japan. He is best known for his theories of management.
Quality assurance (QA) is the term used in both manufacturing and service industries to describe the systematic efforts taken to assure that the product(s) delivered to customer(s) meet with the contractual and other agreed upon performance, design, reliability, and maintainability expectations of that customer. The core purpose of Quality Assurance is to prevent mistakes and defects in the development and production of both manufactured products, such as automobiles and shoes, and delivered services, such as automotive repair and athletic shoe design. Assuring quality and therefore avoiding problems and delays when delivering products or services to customers is what ISO 9000 defines as that "part of quality management focused on providing confidence that quality requirements will be fulfilled". This defect prevention aspect of quality assurance differs from the defect detection aspect of quality control and has been referred to as a shift left since it focuses on quality efforts earlier in product development and production and on avoiding defects in the first place rather than correcting them after the fact.
Walter Andrew Shewhart was an American physicist, engineer and statistician, sometimes known as the father of statistical quality control and also related to the Shewhart cycle.
Control charts are graphical plots used in production control to determine whether quality and manufacturing processes are being controlled under stable conditions. The hourly status is arranged on the graph, and the occurrence of abnormalities is judged based on the presence of data that differs from the conventional trend or deviates from the control limit line. Control charts are classified into Shewhart individuals control chart and CUSUM(CUsUM)(or cumulative sum control chart)(ISO 7870-4).
Statistical process control (SPC) or statistical quality control (SQC) is the application of statistical methods to monitor and control the quality of a production process. This helps to ensure that the process operates efficiently, producing more specification-conforming products with less waste scrap. SPC can be applied to any process where the "conforming product" output can be measured. Key tools used in SPC include run charts, control charts, a focus on continuous improvement, and the design of experiments. An example of a process where SPC is applied is manufacturing lines.
PDCA or plan–do–check–act is an iterative design and management method used in business for the control and continual improvement of processes and products. It is also known as the Shewhart cycle, or the control circle/cycle. Another version of this PDCA cycle is OPDCA. The added "O" stands for observation or as some versions say: "Observe the current condition." This emphasis on observation and current condition has currency with the literature on lean manufacturing and the Toyota Production System. The PDCA cycle, with Ishikawa's changes, can be traced back to S. Mizuno of the Tokyo Institute of Technology in 1959.
Failure mode and effects analysis is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.
Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering.
Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.
In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance, such as in the case of GNSS receivers, or multi-threaded computer processing.
Fault tolerance is the resilient property that enables a system to continue operating properly in the event of failure or major dysfunction in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.
Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.
A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.
ARP4761, Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment is an Aerospace Recommended Practice from SAE International. In conjunction with ARP4754, ARP4761 is used to demonstrate compliance with 14 CFR 25.1309 in the U.S. Federal Aviation Administration (FAA) airworthiness regulations for transport category aircraft, and also harmonized international airworthiness regulations such as European Aviation Safety Agency (EASA) CS–25.1309.
High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
ISO 26262, titled "Road vehicles – Functional safety", is an international standard for functional safety of electrical and/or electronic systems that are installed in serial production road vehicles, defined by the International Organization for Standardization (ISO) in 2011, and revised in 2018.