Availability (system)

Last updated

Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.

Contents

This is measured in terms of nines. Five-9's (99.999%) means less than 5 minutes when the system is not operating correctly over the span of one year.

Availability is only meaningful for supportable systems. As an example, availability of 99.9% means nothing after the only known source stops manufacturing a critical replacement part.

Definition

There are two kinds of availability.

Operational availability is presumed to be the same as predicted availability until after operational metrics become available.

Availability

Operational availability is based on observations after at least one system has been built. This usually begins with the brassboard system that is used to complete system development, and continues with the first of kind used for live fire test and evaluation (LFTE). Organizations responsible for maintenance use this to evaluate the effectiveness of the maintenance philosophy.

Predicted availability is based on a model of the system before it is built.

Downtime is the total of all of the different contributions that compromise operation. For modeling, these are different aspects of the model, such as human-system interface for MTTR and reliability modeling for MTBF. For observation, these reflect the different areas of the organization, such as maintenance personnel and documentation for MTTR, and manufacturers and shippers for MLDT.

MTB

Mean Time Between (MTB) depends upon the maintenance philosophy.

If a system is designed with automatic fault bypass, then MTB is the anticipated lifespan of the system if these features recover all possible failure modes (infinity for all practical purposes). Such systems will continue with noticeable interruption when these conditions are satisfied unless there is an open request. This is called active feedback, which requires maintenance to prevent mission failure. Active response is required for systems that can be maintained, such as satellites.

If a system has no redundancy, then MTB is in return of failure rate, .

Systems with spare parts that are energized but that lack automatic fault bypass are to accept actually results because human action is required to restore operation after every failure. This depends upon Condition-based maintenance and Planned Maintenance System support.

MTTR

Mean Time To Recover (MTTR) is the length of time required to restore operation to specification.

This includes three values.

Mean Time To Discover is the length of time that transpires between when a failure occurs and the system users become aware of the failure. There are two maintenance philosophies associated with Mean Time To Discover.

CBM works like your car where an oil indicator tells you when oil pressure is too low and a temperature indicator tells you when engine temperature is too high. There is zero time to discover a failure where an indicator is placed in front of a system operator.

PMS is required for silent failures that lack CBM. PMS works is periodic maintenance, like when you perform diagnostic tests on your car every 90 days (or 3,000 miles). A failure may occur any time during the 90 days, such as a broken light, but you will not become aware until you perform diagnostic test.

Mean Time To Discover is statistical when PMS is the dominant maintenance philosophy. For example, if a fault is discovered during PMS diagnostic procedure that is run every 10 days, the average fault duration will be 5 days. This creates a dependency between availability performance and labor costs. There is no such dependency associated with CBM.

Mean Time To Isolate is the average length of time required to identify a setting that needs to be adjusted or a component that needs to be replaced. This is dependent on documentation, training, and technical support. This tends to be less on systems that have CBM because users can begin with the list of items connected to the indicator used to notify users about the fault. This also tends to be less on fully documented systems.

Mean Time To Repair is the average length of time to restore operation. For mission critical systems, this is generally estimated by dividing time required to replace all parts by the number of replaceable parts.

MLDT

Mean Logistics Delay Time is the average time required to obtain replacement parts from the manufacturer and transport those parts to the work site.

MAMDT

Mean Active Maintenance Down Time is associated with Planned Maintenance System support philosophy. This is average amount of time while the system is not 100% operational because of diagnostic testing that requires down time.

For example, an automobile that requires 1 day of maintenance every 90 days has a Mean Active Maintenance Down Time of just over 1%.

This is separate from the type of down time associated with repair activities.

Nines

Availability expectations are described in terms of nines.

The following table shows the anticipated down-time for different availabilities for a mission time of one year. This is the typical time-span used with commercial systems.

90%99%99.9%99.99%99.999%99.9999%
40 days4 days9 hours50 minutes5 minutes30 seconds

Supportability

Systems that require maintenance are said to be supportable if they satisfy the following criteria.

Systems that lack any of these requirements are said to be unsupportable.

Mission Failure

Mission failure is the result of trying to use a system in its normal mode when it is not working.

Apart from human error, mission failure results from the following causes.

See also

Related Research Articles

In reliability engineering, the term availability has the following meanings:

Unavailability, in mathematical terms, is the probability that an item will not operate correctly at a given time and under specified conditions. It opposes availability.

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

<span class="mw-page-title-main">Maintenance</span> Maintaining a device in working condition

The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure, and supporting utilities in industrial, business, and residential installations. Over time, this has come to include multiple wordings that describe various cost-effective practices to keep equipment operational; these activities occur either before or after a failure.

In computer science, communicating sequential processes (CSP) is a formal language for describing patterns of interaction in concurrent systems. It is a member of the family of mathematical theories of concurrency known as process algebras, or process calculi, based on message passing via channels. CSP was highly influential in the design of the occam programming language and also influenced the design of programming languages such as Limbo, RaftLib, Erlang, Go, Crystal, and Clojure's core.async.

A service-level agreement (SLA) is a commitment between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. The most common component of an SLA is that the services should be provided to the customer as agreed upon in the contract. As an example, Internet service providers and telcos will commonly include service level agreements within the terms of their contracts with customers to define the level(s) of service being sold in plain language terms. In this case, the SLA will typically have a technical definition of mean time between failures (MTBF), mean time to repair or mean time to recovery (MTTR); identifying which party is responsible for reporting faults or paying fees; responsibility for various data rates; throughput; jitter; or similar measurable details.

Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Mean time to repair (MTTR) is a basic measure of the maintainability of repairable items. It represents the average time required to repair a failed component or device. Expressed mathematically, it is the total corrective maintenance time for failures divided by the total number of corrective maintenance actions for failures during a given period of time. It generally does not include lead time for parts not readily available or other Administrative or Logistic Downtime (ALDT).

In systems engineering and requirements engineering, a non-functional requirement (NFR) is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviours. They are contrasted with functional requirements that define specific behavior or functions. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture, because they are usually architecturally significant requirements.

High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

In organizational management, mean down time (MDT) is the average time that a system is non-operational. This includes all downtime associated with repair, corrective and preventive maintenance, self-imposed downtime, and any logistics or administrative delays.

Maintenance Philosophy is the mix of strategies that ensure an item works as expected when needed.

Health and usage monitoring systems (HUMS) is a generic term given to activities that utilize data collection and analysis techniques to help ensure availability, reliability and safety of vehicles. Activities similar to, or sometimes used interchangeably with, HUMS include condition-based maintenance (CBM) and operational data recording (ODR). This term HUMS is often used in reference to airborne craft and in particular rotor-craft – the term is cited as being introduced by the offshore oil industry after a commercial Chinook crashed in the North Sea, killing all but one passenger and one crew member in 1986.

Software reliability testing is a field of software-testing that relates to testing a software's ability to function, given environmental conditions, for a particular amount of time. Software reliability testing helps discover many problems in the software design and functionality.

Integrated vehicle health management (IVHM) or integrated system health management (ISHM) is the unified capability of systems to assess the current or future state of the member system health and integrate that picture of system health within a framework of available resources and operational demand.

Fault reporting is a maintenance concept that increases operational availability and that reduces operating cost through three mechanisms.

Operational availability in systems engineering is a measurement of how long a system has been available to use when compared with how long it should have been available to be used.

Active redundancy is a design concept that increases operational availability and that reduces operating cost by automating most critical maintenance actions.

Mean Time to Dangerous Failure. In a safety system MTTFD is the portion of failure modes that can lead to failures that may result in hazards to personnel, environment or equipment.

References

PD-icon.svg This article incorporates public domain material from Federal Standard 1037C. General Services Administration. (in support of MIL-STD-188).