Failure rate

Last updated

Failure rate is the frequency with which any system or component fails, expressed in failures per unit of time. It thus depends on the system conditions, time interval, and total number of systems under study. [1] It can describe electronic, mechanical, or biological systems, in fields such as systems and reliability engineering, medicine and biology, or insurance and finance. It is usually denoted by the Greek letter (lambda).

Contents

In real-world applications, the failure probability of a system usually differs over time; failures occur more frequently in early-life ("burning in"), or as a system ages ("wearing out"). This is known as the bathtub curve, where the middle region is called the "useful life period".

Mean time between failures (MTBF)

The mean time between failures (MTBF, ) is often reported instead of the failure rate, as numbers such as "2,000 hours" are more intuitive than numbers such as "0.0005 per hour".

However, this is only valid if the failure rate is actually constant over time, such as within the flat region of the bathtub curve. In many cases where MTBF is quoted, it refers only to this region; thus it cannot be used to give an accurate calculation of the average lifetime of a system, as it ignores the "burn-in" and "wear-out" regions.

MTBF appears frequently in engineering design requirements, and governs the frequency of required system maintenance and inspections. A similar ratio used in the transport industries, especially in railways and trucking, is "mean distance between failures" - allowing maintenance to be scheduled based on distance travelled, rather than at regular time intervals.

Mathematical definition

The simplest definition of failure rate is simply the number of failures per time interval :

which would depend on the number of systems under study, and the conditions over the time period.

Failures over time

Cumulative distribution function for the exponential distribution, often used as the cumulative failure function
F
(
t
)
.
{\displaystyle F(t).} Exponential distribution cdf.svg
Cumulative distribution function for the exponential distribution, often used as the cumulative failure function

To accurately model failures over time, a cumulative failure distribution, must be defined, which can be any cumulative distribution function (CDF) that gradually increases from to . In the case of many identical systems, this may be thought of as the fraction of systems failing over time , after all starting operation at time ; or in the case of a single system, as the probability of the system having its failure time before time :

As CDFs are defined by integrating a probability density function, the failure probability density is defined such that:

Exponential probability functions, often used as the failure probability density
f
(
t
)
{\displaystyle f(t)}
. Exponential pdf.svg
Exponential probability functions, often used as the failure probability density .

where is a dummy integration variable. Here can be thought of as the instantaneous failure rate, i.e. the fraction of failures per unit time, as the size of the time interval tends towards .

Hazard rate

A concept closely-related but different [2] to instantaneous failure rate is the hazard rate (or hazard function), .

In the many-system case, this is defined as the proportional failure rate of the systems still functioning at time (as opposed to , which is the expressed as a proportion of the initial number of systems).

For convenience we first define the reliability (or survival function) as:

then the hazard rate is simply the instantaneous failure rate, scaled by the fraction of surviving systems at time :

In the probabilistic sense, for a single system this can be interpreted as the conditional probability of failure time within the time interval to , given that the system or component has already survived to time :

Conversion to cumulative failure rate

To convert between and , we can solve the differential equation

with initial condition , which yields [2]

Thus for a collection of identical systems, only one of hazard rate , failure probability density , or cumulative failure distribution need be defined.

Confusion can occur as the notation for "failure rate" often refers to the function rather than [3]

Constant hazard rate model

There are many possible functions that could be chosen to represent failure probability density or hazard rate , based on empirical or theoretical evidence, but the most common and easily-understandable choice is to set

,

an exponential function with scaling constant . As seen in the figures above, this represents a gradually decreasing failure probability density.

The CDF is then calculated as:

which can be seen to gradually approach as representing the fact that eventually all systems under study will fail.

The hazard rate function is then:

In other words, in this particular case only, the hazard rate is constant over time.

This illustrates the difference in hazard rate and failure probability density - as the number of systems surviving at time gradually reduces, the total failure rate also reduces, but the hazard rate remains constant. In other words, the probabilities of each individual system failing do not change over time as the systems age - they are "memory-less".

Other models

Hazard function
h
(
t
)
{\displaystyle h(t)}
plotted for a selection of log-logistic distributions, any of which could be used as a hazard rate, depending on the system under study. Loglogistichaz.svg
Hazard function plotted for a selection of log-logistic distributions, any of which could be used as a hazard rate, depending on the system under study.

For many systems, a constant hazard function may not be a realistic approximation; the chance of failure of an individual component may depend on its age. Therefore, other distributions are often used.

For example, the deterministic distribution increases hazard rate over time (for systems where wear-out is the most important factor), while the Pareto distribution decreases it (for systems where early-life failures are more common). The commonly-used Weibull distribution combines both of these effects, as do the log-normal and hypertabastic distributions.

After modelling a given distribution and parameters for , the failure probability density and cumulative failure distribution can be predicted using the given equations.

Measuring failure rate

Failure rate data can be obtained in several ways. The most common means are:

Estimation
From field failure rate reports, statistical analysis techniques can be used to estimate failure rates. For accurate failure rates the analyst must have a good understanding of equipment operation, procedures for data collection, the key environmental variables impacting failure rates, how the equipment is used at the system level, and how the failure data will be used by system designers.
Historical data about the device or system under consideration
Many organizations maintain internal databases of failure information on the devices or systems that they produce, which can be used to calculate failure rates for those devices or systems. For new devices or systems, the historical data for similar devices or systems can serve as a useful estimate.
Government and commercial failure rate data
Handbooks of failure rate data for various components are available from government and commercial sources. MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, is a military standard that provides failure rate data for many military electronic components. Several failure rate data sources are available commercially that focus on commercial components, including some non-electronic components.
Prediction
Time lag is one of the serious drawbacks of all failure rate estimations. Often by the time the failure rate data are available, the devices under study have become obsolete. Due to this drawback, failure-rate prediction methods have been developed. These methods may be used on newly designed devices to predict the device's failure rates and failure modes. Two approaches have become well known, Cycle Testing and FMEDA.
Life Testing
The most accurate source of data is to test samples of the actual devices or systems in order to generate failure data. This is often prohibitively expensive or impractical, so that the previous data sources are often used instead.
Cycle Testing
Mechanical movement is the predominant failure mechanism causing mechanical and electromechanical devices to wear out. For many devices, the wear-out failure point is measured by the number of cycles performed before the device fails, and can be discovered by cycle testing. In cycle testing, a device is cycled as rapidly as practical until it fails. When a collection of these devices are tested, the test will run until 10% of the units fail dangerously.
FMEDA
Failure modes, effects, and diagnostic analysis (FMEDA) is a systematic analysis technique to obtain subsystem / product level failure rates, failure modes and design strength. The FMEDA technique considers:

Given a component database calibrated with field failure data that is reasonably accurate [4] , the method can predict product level failure rate and failure mode data for a given application. The predictions have been shown to be more accurate [5] than field warranty return analysis or even typical field failure analysis given that these methods depend on reports that typically do not have sufficient detail information in failure records. [6]

Examples

Decreasing failure rates

A decreasing failure rate describes cases where early-life failures are common [7] and corresponds to the situation where is a decreasing function.

This can describe, for example, the period of infant mortality in humans, or the early failure of a transistors due to manufacturing defects.

Decreasing failure rates have been found in the lifetimes of spacecraft - Baker and Baker commenting that "those spacecraft that last, last on and on." [8] [9]

The hazard rate of aircraft air conditioning systems was found to have an exponentially decreasing distribution. [10]

Renewal processes

In special processes called renewal processes, where the time to recover from failure can be neglected, the likelihood of failure remains constant with respect to time.

For a renewal process with DFR renewal function, inter-renewal times are concave.[ clarification needed ] [11] [12] Brown conjectured the converse, that DFR is also necessary for the inter-renewal times to be concave, [13] however it has been shown that this conjecture holds neither in the discrete case [12] nor in the continuous case. [14]

Coefficient of variation

When the failure rate is decreasing the coefficient of variation is ⩾ 1, and when the failure rate is increasing the coefficient of variation is ⩽ 1.[ clarification needed ] [15] Note that this result only holds when the failure rate is defined for all t  0 [16] and that the converse result (coefficient of variation determining nature of failure rate) does not hold.

Units

Failure rates can be expressed using any measure of time, but hours is the most common unit in practice. Other units, such as miles, revolutions, etc., can also be used in place of "time" units.

Failure rates are often expressed in engineering notation as failures per million, or 10−6, especially for individual components, since their failure rates are often very low.

The Failures In Time (FIT) rate of a device is the number of failures that can be expected in one billion (109) device-hours of operation [17] (e.g. 1,000 devices for 1,000,000 hours, or 1,000,000 devices for 1,000 hours each, or some other combination). This term is used particularly by the semiconductor industry.

Combinations of failure types

If a complex system consists of many parts, and the failure of any single part means the failure of the entire system, then the total failure rate is simply the sum of the individual failure rates of its parts

however, this assumes that the failure rate is constant, and that the units are consistent (e.g. failures per million hours), and not expressed as a ratio or as probability densities. This is useful to estimate the failure rate of a system when individual components or subsystems have already been tested. [18] [19]

Adding "redundant" components to eliminate a single point of failure may thus actually increase the failure rate, however reduces the "mission failure" rate, or the "mean time between critical failures" (MTBCF). [20]

Combining failure or hazard rates that are time-dependent is more complicated. For example, mixtures of Decreasing Failure Rate (DFR) variables are also DFR. [11] Mixtures of exponentially distributed failure rates are hyperexponentially distributed.

Simple example

Suppose it is desired to estimate the failure rate of a certain component. Ten identical components are each tested until they either fail or reach 1,000 hours, at which time the test is terminated. A total of 7,502 component-hours of testing is performed, and 6 failures are recorded.

The estimated failure rate is:

which could also be expressed as a MTBF of 1,250 hours, or approximately 800 failures for every million hours of operation.

See also

Related Research Articles

Unavailability, in mathematical terms, is the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability.

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

In physics, a Langevin equation is a stochastic differential equation describing how a system evolves when subjected to a combination of deterministic and fluctuating ("random") forces. The dependent variables in a Langevin equation typically are collective (macroscopic) variables changing only slowly in comparison to the other (microscopic) variables of the system. The fast (microscopic) variables are responsible for the stochastic nature of the Langevin equation. One application is to Brownian motion, which models the fluctuating motion of a small particle in a fluid.

<span class="mw-page-title-main">Weibull distribution</span> Continuous probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory, reliability analysis or reliability engineering in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.

A cyclostationary process is a signal having statistical properties that vary cyclically with time. A cyclostationary process can be viewed as multiple interleaved stationary processes. For example, the maximum daily temperature in New York City can be modeled as a cyclostationary process: the maximum temperature on July 21 is statistically different from the temperature on December 20; however, it is a reasonable approximation that the temperature on December 20 of different years has identical statistics. Thus, we can view the random process composed of daily maximum temperatures as 365 interleaved stationary processes, each of which takes on a new value once per year.

In actuarial science, force of mortality represents the instantaneous rate of mortality at a certain age measured on an annualized basis. It is identical in concept to failure rate, also called hazard function, in reliability theory.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. The hazard rate at time is the probability per short time dt that an event will occur between and given that up to time no event has occurred yet. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed, may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

In probability theory and statistics, the normal-gamma distribution is a bivariate four-parameter family of continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and precision.

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.

In actuarial science and applied probability, ruin theory uses mathematical models to describe an insurer's vulnerability to insolvency/ruin. In such models key quantities of interest are the probability of ruin, distribution of surplus immediately prior to ruin and deficit at time of ruin.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

In statistics, the exponentiated Weibull family of probability distributions was introduced by Mudholkar and Srivastava (1993) as an extension of the Weibull family obtained by adding a second shape parameter.

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive past a certain time. The survival function is also known as the survivor function or reliability function. The term reliability function is common in engineering while the term survival function is used in a broader range of applications, including human mortality. The survival function is the complementary cumulative distribution function of the lifetime. Sometimes complementary cumulative distribution functions are called survival functions in general.

<span class="mw-page-title-main">Exponentially modified Gaussian distribution</span> Describes the sum of independent normal and exponential random variables

In probability theory, an exponentially modified Gaussian distribution describes the sum of independent normal and exponential random variables. An exGaussian random variable Z may be expressed as Z = X + Y, where X and Y are independent, X is Gaussian with mean μ and variance σ2, and Y is exponential of rate λ. It has a characteristic positive skew from the exponential component.

<span class="mw-page-title-main">High-temperature operating life</span> Reliability test applied to integrated circuits

High-temperature operating life (HTOL) is a reliability test applied to integrated circuits (ICs) to determine their intrinsic reliability. This test stresses the IC at an elevated temperature, high voltage and dynamic operation for a predefined period of time. The IC is usually monitored under stress and tested at intermediate intervals. This reliability stress test is sometimes referred to as a lifetime test, device life test or extended burn in test and is used to trigger potential failure modes and assess IC lifetime.

In queueing theory, a discipline within the mathematical theory of probability, an M/D/1 queue represents the queue length in a system having a single server, where arrivals are determined by a Poisson process and job service times are fixed (deterministic). The model name is written in Kendall's notation. Agner Krarup Erlang first published on this model in 1909, starting the subject of queueing theory. An extension of this model with more than one server is the M/D/c queue.

Mean Time to Dangerous Failure. In a safety system MTTFD is the portion of failure modes that can lead to failures that may result in hazards to personnel, environment or equipment.

References

Further reading