Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system. [1]
The definition of MTBF depends on the definition of what is considered a failure. For complex, repairable systems, failures are considered to be those out of design conditions which place the system out of service and into a state for repair. Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out of service, are not considered failures under this definition. [2] In addition, units that are taken down for routine scheduled maintenance or inventory control are not considered within the definition of failure. [3] The higher the MTBF, the longer a system is likely to work before failing.
Mean time between failures (MTBF) describes the expected time between two failures for a repairable system. For example, three identical systems starting to function properly at time 0 are working until all of them fail. The first system fails after 100 hours, the second after 120 hours and the third after 130 hours. The MTBF of the systems is the average of the three failure times, which is 116.667 hours. If the systems were non-repairable, then their MTTF would be 116.667 hours.
In general, MTBF is the "up-time" between two failure states of a repairable system during operation as outlined here:
For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than) the moment it went up, the "up time". The difference ("down time" minus "up time") is the amount of time it was operating between these two events.
By referring to the figure above, the MTBF of a component is the sum of the lengths of the operational periods divided by the number of observed failures:
In a similar manner, mean down time (MDT) can be defined as
The MTBF is the expected value of the random variable indicating the time until failure. Thus, it can be written as [4]
where is the probability density function of . Equivalently, the MTBF can be expressed in terms of the reliability function as
The MTBF and have units of time (e.g., hours).
Any practically-relevant calculation of the MTBF assumes that the system is working within its "useful life period", which is characterized by a relatively constant failure rate (the middle part of the "bathtub curve") when only random failures are occurring. [1] In other words, it is assumed that the system has survived initial setup stresses and has not yet approached its expected end of life, both of which often increase the failure rate.
Assuming a constant failure rate implies that has an exponential distribution with parameter . Since the MTBF is the expected value of , it is given by the reciprocal of the failure rate of the system, [1] [4]
Once the MTBF of a system is known, and assuming a constant failure rate, the probability that any one particular system will be operational for a given duration can be inferred [1] from the reliability function of the exponential distribution, . In particular, the probability that a particular system will survive to its MTBF is , or about 37% (i.e., it will fail earlier with probability 63%). [5]
The MTBF value can be used as a system reliability parameter or to compare different systems or designs. This value should only be understood conditionally as the “mean lifetime” (an average value), and not as a quantitative identity between working and failed units. [1]
Since MTBF can be expressed as “average life (expectancy)”, many engineers assume that 50% of items will have failed by time t = MTBF. This inaccuracy can lead to bad design decisions. Furthermore, probabilistic failure prediction based on MTBF implies the total absence of systematic failures (i.e., a constant failure rate with only intrinsic, random failures), which is not easy to verify. [4] Assuming no systematic errors, the probability the system survives during a duration, T, is calculated as exp^(-T/MTBF). Hence the probability a system fails during a duration T, is given by 1 - exp^(-T/MTBF).
MTBF value prediction is an important element in the development of products. Reliability engineers and design engineers often use reliability software to calculate a product's MTBF according to various methods and standards (MIL-HDBK-217F, Telcordia SR332, Siemens SN 29500, FIDES, UTE 80-810 (RDF2000), etc.). The Mil-HDBK-217 reliability calculator manual in combination with RelCalc software (or other comparable tool) enables MTBF reliability rates to be predicted based on design.
A concept which is closely related to MTBF, and is important in the computations involving MTBF, is the mean down time (MDT). MDT can be defined as mean time which the system is down after the failure. Usually, MDT is considered different from MTTR (Mean Time To Repair); in particular, MDT usually includes organizational and logistical factors (such as business days or waiting for components to arrive) while MTTR is usually understood as more narrow and more technical.
MTBF serves as a crucial metric for managing machinery and equipment reliability. Its application is particularly significant in the context of total productive maintenance (TPM), a comprehensive maintenance strategy aimed at maximizing equipment effectiveness. MTBF provides a quantitative measure of the time elapsed between failures of a system during normal operation, offering insights into the reliability and performance of manufacturing equipment. [6]
By integrating MTBF with TPM principles, manufacturers can achieve a more proactive maintenance approach. This synergy allows for the identification of patterns and potential failures before they occur, enabling preventive maintenance and reducing unplanned downtime. As a result, MTBF becomes a key performance indicator (KPI) within TPM, guiding decisions on maintenance schedules, spare parts inventory, and ultimately, optimizing the lifespan and efficiency of machinery. [7] This strategic use of MTBF within TPM frameworks enhances overall production efficiency, reduces costs associated with breakdowns, and contributes to the continuous improvement of manufacturing processes.
Two components (for instance hard drives, servers, etc.) may be arranged in a network, in series or in parallel. The terminology is here used by close analogy to electrical circuits, but has a slightly different meaning. We say that the two components are in series if the failure of either causes the failure of the network, and that they are in parallel if only the failure of both causes the network to fail. The MTBF of the resulting two-component network with repairable components can be computed according to the following formulae, assuming that the MTBF of both individual components is known: [8] [9]
where is the network in which the components are arranged in series.
For the network containing parallel repairable components, to find out the MTBF of the whole system, in addition to component MTBFs, it is also necessary to know their respective MDTs. Then, assuming that MDTs are negligible compared to MTBFs (which usually stands in practice), the MTBF for the parallel system consisting from two parallel repairable components can be written as follows: [8] [9]
where is the network in which the components are arranged in parallel, and is the probability of failure of component during "vulnerability window" .
Intuitively, both these formulae can be explained from the point of view of failure probabilities. First of all, let's note that the probability of a system failing within a certain timeframe is the inverse of its MTBF. Then, when considering series of components, failure of any component leads to the failure of the whole system, so (assuming that failure probabilities are small, which is usually the case) probability of the failure of the whole system within a given interval can be approximated as a sum of failure probabilities of the components. With parallel components the situation is a bit more complicated: the whole system will fail if and only if after one of the components fails, the other component fails while the first component is being repaired; this is where MDT comes into play: the faster the first component is repaired, the less is the "vulnerability window" for the other component to fail.
Using similar logic, MDT for a system out of two serial components can be calculated as: [8]
and for a system out of two parallel components MDT can be calculated as: [8]
Through successive application of these four formulae, the MTBF and MDT of any network of repairable components can be computed, provided that the MTBF and MDT is known for each component. In a special but all-important case of several serial components, MTBF calculation can be easily generalised into
which can be shown by induction, [10] and likewise
since the formula for the mdt of two components in parallel is identical to that of the mtbf for two components in series.
There are many variations of MTBF, such as mean time between system aborts (MTBSA), mean time between critical failures (MTBCF) or mean time between unscheduled removal (MTBUR). Such nomenclature is used when it is desirable to differentiate among types of failures, such as critical and non-critical failures. For example, in an automobile, the failure of the FM radio does not prevent the primary operation of the vehicle.
It is recommended to use Mean time to failure (MTTF) instead of MTBF in cases where a system is replaced after a failure ("non-repairable system"), since MTBF denotes time between failures in a system which can be repaired. [1]
MTTFd is an extension of MTTF, and is only concerned about failures which would result in a dangerous condition. It can be calculated as follows:
where B10 is the number of operations that a device will operate prior to 10% of a sample of those devices would fail and nop is number of operations. B10d is the same calculation, but where 10% of the sample would fail to danger. nop is the number of operations/cycle in one year. [11]
In fact the MTBF counting only failures with at least some systems still operating that have not yet failed underestimates the MTBF by failing to include in the computations the partial lifetimes of the systems that have not yet failed. With such lifetimes, all we know is that the time to failure exceeds the time they've been running. This is called censoring. In fact with a parametric model of the lifetime, the likelihood for the experience on any given day is as follows:
where
For a constant exponential distribution, the hazard, , is constant. In this case, the MBTF is
where is the maximum likelihood estimate of , maximizing the likelihood given above and is the number of uncensored observations.
We see that the difference between the MTBF considering only failures and the MTBF including censored observations is that the censoring times add to the numerator but not the denominator in computing the MTBF. [12]
In reliability engineering, the term availability has the following meanings:
Unavailability, in mathematical terms, is the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability.
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on some dice as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.
In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.
In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.
In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium. Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives. The partition function is dimensionless.
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory, reliability analysis or reliability engineering in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?
In mathematical statistics, the Kullback–Leibler (KL) divergence, denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q. Mathematically, it is defined as
Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering.
In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.
In queueing theory, a discipline within the mathematical theory of probability, the Pollaczek–Khinchine formula states a relationship between the queue length and service time distribution Laplace transforms for an M/G/1 queue. The term is also used to refer to the relationships between the mean queue length and mean waiting/service time in such a model.
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.
A reliability block diagram (RBD) is a diagrammatic method for showing how component reliability contributes to the success or failure of a redundant system. RBD is also known as a dependence diagram (DD).
A prediction of reliability is an important element in the process of selecting equipment for use by telecommunications service providers and other buyers of electronic equipment, and it is essential during the design stage of engineering systems life cycle. Reliability is a measure of the frequency of equipment failures as a function of time. Reliability has a major impact on maintenance and repair costs and on the continuity of service.
Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.
Maintenance Philosophy is the mix of strategies that ensure an item works as expected when needed.
Software reliability testing is a field of software-testing that relates to testing a software's ability to function, given environmental conditions, for a particular amount of time. Software reliability testing helps discover many problems in the software design and functionality.
In queueing theory, a discipline within the mathematical theory of probability, the M/M/∞ queue is a multi-server queueing model where every arrival experiences immediate service and does not wait. In Kendall's notation it describes a system where arrivals are governed by a Poisson process, there are infinitely many servers, so jobs do not need to wait for a server. Each job has an exponentially distributed service time. It is a limit of the M/M/c queue model where the number of servers c becomes very large.
High-temperature operating life (HTOL) is a reliability test applied to integrated circuits (ICs) to determine their intrinsic reliability. This test stresses the IC at an elevated temperature, high voltage and dynamic operation for a predefined period of time. The IC is usually monitored under stress and tested at intermediate intervals. This reliability stress test is sometimes referred to as a lifetime test, device life test or extended burn in test and is used to trigger potential failure modes and assess IC lifetime.
Mean Time to Dangerous Failure. In a safety system MTTFD is the portion of failure modes that can lead to failures that may result in hazards to personnel, environment or equipment.