Maintenance philosophy

Last updated

Maintenance Philosophy is the mix of strategies that ensure an item works as expected when needed. [1] [2]

Contents

Definition

Maintenance is a form of risk management that is required if and only if an item fails to satisfy the minimum level of specification performance when the items or system is required.

Maintenance is optional and may not be required if the partially failed item still satisfies the minimum level of specification performance or if the item is not required for a span of time.

Maintenance takes place in four phases.

An item is said to be degraded when faults exist but normal operation can continue.

Automatic recovery is used to avoid the need for maintenance.

Automatic recovery from failure is required for systems and resources that cannot be accessed during deployment, such as rockets, missiles, satellites, submersibles, and items that are buried or encapsulated. There are multiple approaches.

Redundant items increase failure rate and reduce reliability if recovery is not automatic.

Failure Detection

Failure Detection involves two different maintenance strategies that interact with life-cycle cost and availability.

Conditional

Conditional maintenance relies on indicators that tell users when an item is failed.

  • System is totally failed and cannot operate as expected
  • System will function as expected but is degraded

This requires automatic fault detection and reporting.

Condition Based Maintenance (CBM) requires clearly observable or audible notification that is suitable for unsophisticated and untrained users, which includes the following.

  • Colored indicator (red or yellow light)
  • Display showing the phrase failed or degraded next to the item name
  • Gage with clearly defined green, yellow, and red bands for normal versus faulted
  • Audible indications, such as a buzzer, bell, or synthesized voice

Recovery maintenance actions begin after notification occurs.

Items are said to be instrumented when notification takes place automatically upon failure. There are two approaches.

  • End-To-End (ETE)
  • Self-reporting devices

ETE testing involves an automated process that periodically injects something into the item, then the outputs are examined to determine if they satisfy the level of performance required by the specification. This may be intrusive, and could interfere with normal operation briefly.

Self-reporting devices include automatic built-in-test (BIT) features that are less intrusive.

Items without the kinds of notifications suitable for CBM have silent failure modes that require periodic preventative maintenance actions.

Periodic

Probability of operational failure accumulates as time passes until diagnostic or preventative maintenance actions eliminate any actual failures. Periodic maintenance.gif
Probability of operational failure accumulates as time passes until diagnostic or preventative maintenance actions eliminate any actual failures.

Operational failure will eventually occur when an item is used in its normal mode of operation if there is no intervention. The procedures associated with periodic maintenance are generally called a Periodic Maintenance System (PMS).

There is risk that the system will not work as expected, and this risk grows as time passes due to increasing possibility of silent faults that cause operational failure.

Periodic maintenance actions control risk of operational failure. This relies on invasive procedures that renders a system inoperable for a brief period while users run manual diagnostic or preventative procedures. The following are a few examples.

  • Calibration
  • Built In Test (BIT)
  • External Diagnostics (instrumentation)
  • System Operational Test (SOT)

The item is down and is unavailable for normal operation during the time while a periodic maintenance procedure is being performed.

Failure is statistical. There is a random chance that the system or item will not function when required. Reliability declines as the time passes, and probability of failure increases until action is taken.

The item will eventually fail if there is no intervention.

Periodic maintenance increasingly reduces operational failure risks as procedure are used more often. Average reliability improves as the time between maintenance actions is reduced.

As an example, an item with no CBM features will work as expected about 90% of the time if periodic maintenance is performed about 5 times more frequently than the MTBF.

Fault Isolation

Fault Isolation is the strategy used to identify the root cause for a failure. There are two methods.

Automatic Fault Isolation

Automatic Fault Isolation identifies the root cause for failure with no manual intervention.

This is generally used to control redundant items when it is necessary to automatically bypass failures.

Manual Fault Isolation

Manual Fault Isolation is when maintenance personnel must identify root cause for a failure. This usually requires the following.

  • Manual diagnostic tests
  • Test equipment
  • Spare parts
  • Documentation
  • Training

Device instrumentation used with CBM is generally used to reduce the time and effort required to isolate root cause.

Corrective Action

Corrective Action is the activity that restores performance for the item or system after a failure.

There are two kinds of corrective action.

Automatic Corrective Action

Automatic correction is possible for redundant systems when fault-detection, fault-isolation, and fault-bypass are all automatic.

Automatic corrective action is also called Active Recovery and Self Healing.

This technique can be used to increase the MTBF to the length of time an item will be required to be used without maintenance.

As an example, failure is expected for space vehicles that can be required to operate correctly for as much as 10 years in a hostile environment.

Redundancy can be achieved by launching a large number of satellites, which is a practical solution for things like the Global Positioning System (GPS) because each vehicle occupies a slightly different orbit.

This is not possible for geosynchronous orbit, where all functions must be accomplished by one vehicle that performs all functions must maintain stable position over one specific spot over the earth surface. Satellites intended to operate in geosynchronous orbit must incorporate active recovery that prevents total failure when one or more parts fail.

Automatic Corrective Action incorporates all of the spare parts into the design to accommodate all of the failures that can be anticipated during a specific period of time.

Manual Corrective Action

Manual corrective action is when trained maintenance personnel perform a calibration or replacement action to restore operation.

Corrective actions for redundant items includes manual reconfiguration when automatic fault bypass is not available, which depends upon maintenance coverage.

Failed part replacement depends upon the Lowest Replaceable Unit (LRU). This could be a part inside an item, or it could be the whole item. This decision is made based on which is less expensive to replace.

As an example, a new disk drive costs about $200 to purchase, the technical assistance to replace the disk drive is $500, and a refurbished computer costs about $600. If you replace your own disk drive and install your own operating system, then it is less expensive to purchase the disk drive. If you need technical help then it is less expensive to replace the whole computer.

Operational Verification

Operational Verification is any action that is performed to verify that the item or system is operational.

This generally involves using the system in its normal mode of operation, which could involve actual operation or simulated operation.

Reliability

Maintenance is closely associated with reliability because maintenance is required to restore capability that has been lost due to failure.

Electronic devices decay in a way that is mathematically equivalent to radioactive decay processes for unstable atoms.

Electronic failure is governed by random processes, where Mean Time Between Failure identifies the average number of hours until failure occurs. Lambda identifies the number of failures expected per hour.

Reliability is the probability that a failure will not occur during a specific span of time.

Failure rate relies on logarithmic math to simplify calculations using that is very similar to the type of analysis used for electronic circuits.

Overall failure rate for a complex item is the sum of all the failure rates for all of the individual components in the item. This applies to situations where failure of one component causes the entire item to fail. The type of calculation is similar to a series electronic circuit.

Overall failure rate for items with full redundant overlap is the inverse of the sum of MTBF for all of the individual redundant items. This applies to situations where all of the components in the item must all fail before the item fails. The type of calculation is similar to a parallel electronic circuit.

A reliability block diagram is used to construct a model for large items. This provides traceability when funding and manpower requirements are identified using reliability calculations.

Failure rate for silicon and carbon devices doubles for each C temperature rise. Electronic devices operating at C will fail 64 times more frequently than the same kind of items operating at C. This relationship holds true above C.

Transportation reliability is similar, but values are expressed in terms of distance, such as fault per mile or faults per kilometer.

Failure rate can be expressed in terms of the number of cycles. Thermal shock caused by heating and cooling can induce failure when power is cycled on and off. Most mechanical switches are built to operate 10,000 cycles before failure, which is about 30 years for a cycle rate of 1 action per day.

Distance, cycle, and decay reliability all have separate contributions that effect the overall failure rate.

Availability

Availability is generally used with systems that incorporate periodic maintenance.

Availability is the probability that an item will operate correctly during a period of time when used at random times during that period.

Available time is the time while the system is fully operational. Down time is the time while the system is unavailable for normal use, and this consist of the time while periodic maintenance is being performed and the amount of time while the system is faulted.

Availability calculations are meaningful for items with replaceable parts only when failure modes have adequate coverage.

Readiness

Readiness is meaningful when the item does not require down time for periodic maintenance. This is a useful measurement for items that incorporate automatic recovery or condition based maintenance.

Readiness is the probability that an item will operate as expected when used at any random time while the item is in the correct mode of operation.

Mean Time To Recover form manual actions is generally measured or estimated. The following is an example of the kind of values that could be used for estimating the mechanical portion of the recovery time associated with replacing a failed circuit card.

120 seconds
remove 15 seconds; replace 30 seconds
remove 30 seconds; replace 60 seconds
disconnect 15 seconds; reconnect 60 seconds
remove 30 seconds; insert 120 seconds

Readiness calculations are meaningful for items with replaceable parts only when failure modes have adequate coverage.

Coverage

Maintenance coverage evaluates the proportion of faults detected by CBM and PMS.

A rough estimate of coverage can be made by observing the ratio between operational failures and maintenance actions.

Availability calculations, readiness calculations, and related claims are only valid if coverage exceeds availability.

Military Versus Commercial

Military maintenance philosophy versus commercial. Military maintenance philosophy.png
Military maintenance philosophy versus commercial.

Military systems and large commercial systems share reliability constraints.

The ability to for a military system to continue operating after battle damage is survivability .

Military Maintenance Policy (MMP) is required for defense systems. Designs typically include redundancy with automatic fault detection, automatic fault isolation, and automatic fault bypass. These reconfigure systems without human intervention after combat damage and normal failure.

Most Commercial Off The Shelf (COTS) items are deployed in a benign environment, but electronic devices fail much like constant random battle damage. This effect grows worse as size grows.

Excessive down-time is a type of design defect that impacts all large systems.

As an example, if a system is built from 1,000 individual computers each with a 3-year Mean Time Between Failure (MTBF), then the whole system will have an MTBF of 1 day. If Mean Time To Repair (MTTR) is 3 days, then the system will never work.

If the same system includes 1,010 computers, then failure will be rare if the system includes automatic fault detection, automatic fault isolation, and automatic fault bypass.

This shows why large commercial systems require the same kind of maintenance philosophy as military systems.

See also

Related Research Articles

In reliability engineering, the term availability has the following meanings:

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

<span class="mw-page-title-main">Weibull distribution</span> Continuous probability distribution

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It models a broad range of random variables, largely in the nature of a time to failure or time between events. Examples are maximum one-day rainfalls and the time a user spends on a web page.

<span class="mw-page-title-main">Lattice model (physics)</span>

In mathematical physics, a lattice model is a mathematical model of a physical system that is defined on a lattice, as opposed to a continuum, such as the continuum of space or spacetime. Lattice models originally occurred in the context of condensed matter physics, where the atoms of a crystal automatically form a lattice. Currently, lattice models are quite popular in theoretical physics, for many reasons. Some models are exactly solvable, and thus offer insight into physics beyond what can be learned from perturbation theory. Lattice models are also ideal for study by the methods of computational physics, as the discretization of any continuum model automatically turns it into a lattice model. The exact solution to many of these models includes the presence of solitons. Techniques for solving these include the inverse scattering transform and the method of Lax pairs, the Yang–Baxter equation and quantum groups. The solution of these models has given insights into the nature of phase transitions, magnetization and scaling behaviour, as well as insights into the nature of quantum field theory. Physical lattice models frequently occur as an approximation to a continuum theory, either to give an ultraviolet cutoff to the theory to prevent divergences or to perform numerical computations. An example of a continuum theory that is widely studied by lattice models is the QCD lattice model, a discretization of quantum chromodynamics. However, digital physics considers nature fundamentally discrete at the Planck scale, which imposes upper limit to the density of information, aka Holographic principle. More generally, lattice gauge theory and lattice field theory are areas of study. Lattice models are also used to simulate the structure and dynamics of polymers.

In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants". The Hessian is sometimes denoted by H or, ambiguously, by ∇2.

<span class="mw-page-title-main">Granular material</span> Conglomeration of discrete solid, macroscopic particles

A granular material is a conglomeration of discrete solid, macroscopic particles characterized by a loss of energy whenever the particles interact. The constituents that compose granular material are large enough such that they are not subject to thermal motion fluctuations. Thus, the lower size limit for grains in granular material is about 1 μm. On the upper size limit, the physics of granular materials may be applied to ice floes where the individual grains are icebergs and to asteroid belts of the Solar System with individual grains being asteroids.

Failure rate is the frequency with which an engineered system or component fails, expressed in failures per unit of time. It is usually denoted by the Greek letter λ (lambda) and is often used in reliability engineering.

Failure mode effects and criticality analysis (FMECA) is an extension of failure mode and effects analysis (FMEA).

In statistics, a power transform is a family of functions applied to create a monotonic transformation of data using power functions. It is a data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association, and for other data stabilization procedures.

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In queueing theory, a discipline within the mathematical theory of probability, the M/M/c queue is a multi-server queueing model. In Kendall's notation it describes a system where arrivals form a single queue and are governed by a Poisson process, there are c servers, and job service times are exponentially distributed. It is a generalisation of the M/M/1 queue which considers only a single server. The model with infinitely many servers is the M/M/∞ queue.

<span class="mw-page-title-main">Reliability block diagram</span>

A reliability block diagram (RBD) is a diagrammatic method for showing how component reliability contributes to the success or failure of a redundant. RBD is also known as a dependence diagram (DD).

A dependability state diagram is a method for modelling a system as a Markov chain. It is used in reliability engineering for availability and reliability analysis.

Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.

Software reliability testing is a field of software-testing that relates to testing a software's ability to function, given environmental conditions, for a particular amount of time. Software reliability testing helps discover many problems in the software design and functionality.

Fault reporting is a maintenance concept that increases operational availability and that reduces operating cost by three mechanisms:

Operational availability in systems engineering is a measurement of how long a system has been available to use when compared with how long it should have been available to be used.

Active redundancy is a design concept that increases operational availability and that reduces operating cost by automating most critical maintenance actions.

References

  1. "New Reliability Policy Issued". Defense Acquisition University.
  2. "DoN Issuances". Department of the Navy. Archived from the original on 2013-03-17.