Mean time to recovery

Last updated

Mean time to recovery (MTTR) [1] [2] [3] is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses (where the MTTR would be very short, probably seconds), to whole systems which have to be repaired or replaced.

The MTTR would usually be part of a maintenance contract, where the user would pay more for a system MTTR of which was 24 hours, than for one of, say, 7 days. This does not mean the supplier is guaranteeing to have the system up and running again within 24 hours (or 7 days) of being notified of the failure. It does mean the average repair time will tend towards 24 hours (or 7 days). A more useful maintenance contract measure is the maximum time to recovery which can be easily measured and the supplier held accountably.

Note that some suppliers will interpret MTTR to mean 'mean time to respond' and others will take it to mean 'mean time to replace/repair/recover/resolve'. The former indicates that the supplier will acknowledge a problem and initiate mitigation within a certain timeframe. Some systems may have an MTTR of zero, which means that they have redundant components which can take over the instant the primary one fails, see RAID for example. However, the failed device involved in this redundant configuration still needs to be returned to service and hence the device itself has a non-zero MTTR even if the system as a whole (through redundancy) has an MTTR of zero. But, as long as service is maintained, this is a minor issue.

See also

Related Research Articles

In reliability engineering, the term availability has the following meanings:

Unavailability, in mathematical terms, is the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability.

RAID is a data storage virtualization technology that combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both. This is in contrast to the previous concept of highly reliable mainframe disk drives referred to as "single large expensive disk" (SLED).

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

A service-level agreement (SLA) is an agreement between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. The most common component of an SLA is that the services should be provided to the customer as agreed upon in the contract. As an example, Internet service providers and telcos will commonly include service level agreements within the terms of their contracts with customers to define the level(s) of service being sold in plain language terms. In this case, the SLA will typically have a technical definition of mean time between failures (MTBF), mean time to repair or mean time to recovery (MTTR); identifying which party is responsible for reporting faults or paying fees; responsibility for various data rates; throughput; jitter; or similar measurable details.

Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Fault tolerance is the ability of a system to maintain proper operation in the event of failures or faults in one or more of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can lead to total breakdown. Fault tolerance is particularly sought after in high-availability, mission-critical, or even life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

In computing, data recovery is a process of retrieving deleted, inaccessible, lost, corrupted, damaged, or formatted data from secondary storage, removable media or files, when the data stored in them cannot be accessed in a usual way. The data is most often salvaged from storage media such as internal or external hard disk drives (HDDs), solid-state drives (SSDs), USB flash drives, magnetic tapes, CDs, DVDs, RAID subsystems, and other electronic devices. Recovery may be required due to physical damage to the storage devices or logical damage to the file system that prevents it from being mounted by the host operating system (OS).

Mean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. Expressed mathematically, it is the total corrective maintenance time for failures divided by the total number of corrective maintenance actions for failures during a given period of time. It generally does not include lead time for parts not readily available or other Administrative or Logistic Downtime (ALDT).

A hot spare or warm spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation. More generally, a hot standby can be used to refer to any device or system that is held in readiness to overcome an otherwise significant start-up delay.

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In organizational management, mean down time (MDT) is the average time that a system is non-operational. This includes all downtime associated with repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or administrative delays.

The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is usually a result of the system failing to function because of an unplanned event, or because of routine maintenance.

A spare part, spare, service part, repair part, or replacement part, is an interchangeable part that is kept in an inventory and used for the repair or refurbishment of defective equipment/units. Spare parts are an important feature of logistics engineering and supply chain management, often comprising dedicated spare parts management systems.

A prediction of reliability is an important element in the process of selecting equipment for use by telecommunications service providers and other buyers of electronic equipment, and it is essential during the design stage of engineering systems life cycle. Reliability is a measure of the frequency of equipment failures as a function of time. Reliability has a major impact on maintenance and repair costs and on the continuity of service.

Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.

Maintenance Philosophy is the mix of strategies that ensure an item works as expected when needed.

Software reliability testing is a field of software-testing that relates to testing a software's ability to function, given environmental conditions, for a particular amount of time. Software reliability testing helps discover many problems in the software design and functionality.

RAMP Simulation Software for Modelling Reliability, Availability and Maintainability (RAM) is a computer software application developed by WS Atkins specifically for the assessment of the reliability, availability, maintainability and productivity characteristics of complex systems that would otherwise prove too difficult, cost too much or take too long to study analytically. The name RAMP is an acronym standing for Reliability, Availability and Maintainability of Process systems.

References

  1. "Also refer to "Mean Time To Repair" or "Mean Time To Restore"".
  2. INTEL call for Mean-Time-to-Repair on page 4 left. "Unknown" (PDF). Retrieved May 7, 2020.[ dead link ]
  3. Atlassian. "MTBF, MTTR, MTTF, MTTA: Understanding incident metrics". Atlassian. Retrieved 2023-09-26.