Intermittent fault

Last updated

An intermittent fault, often called simply an "intermittent", (or anecdotally "interfailing") is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.

Contents

Intermittent faults are not easily repeatable because of their complicated behavioral patterns. These are also sometimes referred to as “soft” failures, since they do not manifest themselves all the time and disappear in an unpredictable manner. In contrast, “hard” failures are permanent failures that occur over a period of time (or are sometimes instantaneous). They have a specific failure site (location of failure), mode (how the failure manifests itself), and mechanism, and there is no unpredictable recovery for the failed system. Since intermittent faults are not easily repeatable, it is more difficult to conduct a failure analysis for them, understand their root causes, or isolate their failure site than it is for permanent failures. [1]

Intermittent failures can be a cause of no-fault-found (NFF) occurrences in electronic products and systems. NFF implies that a failure (fault) occurred or was reported to have occurred during a product’s use. The product was analyzed or tested to confirm the failure, but “a failure or fault” could be not found. A common example of the NFF phenomenon occurs when your computer “hangs up”. Clearly, a “failure” has occurred. However, if the computer is rebooted, it often works again. The impact of NFF and intermittent failures can be profound. Due to their characteristics, manufacturers may assume a cause(s) rather than spend the time and cost to determine a root cause. For example, a hard drive supplier claimed NFFs were not failures and allowed all NFF products to be returned to the field. Later it was determined that these products had a significantly higher return rate, suggesting that the NFF condition was actually a result of intermittent failures in the product. The result was increased maintenance costs, decreased equipment availability, increased customer inconvenience, reduced customer confidence, damaged company reputation, and in some cases potential safety hazards. [2]

A simple example of an effectively random cause in a physical system is a borderline electrical connection in the wiring or a component of a circuit, where (cause 1, the cause that must be identified and rectified) two conductors may touch subject to (cause 2, which need not be identified) a minor change in temperature, vibration, orientation, voltage, etc. (Sometimes this is described as an "intermittent connection" rather than "fault".) In computer software a program may (cause 1) fail to initialise a variable which is required to be initially zero; if the program is run in circumstances such that memory is almost always clear before it starts, it will malfunction on the rare occasions that (cause 2) the memory where the variable is stored happens to be non-zero beforehand.

Intermittent faults are notoriously difficult to identify and repair ("troubleshoot") because each individual factor does not create the problem alone, so the factors can only be identified while the malfunction is actually occurring. The person capable of identifying and solving the problem is seldom the usual operator. Because the timing of the malfunction is unpredictable, and both device or system downtime and engineers' time incur cost, the fault is often simply tolerated if not too frequent unless it causes unacceptable problems or dangers. For example, some intermittent faults in critical equipment such as medical life support equipment could result in killing a patient or in aeronautics causes a flight to be aborted or in some cases crash.

If an intermittent fault occurs for long enough during troubleshooting, it can be identified and resolved in the usual way.


Most recent efforts occurring in the U.S. military weapon system testing uses a technique given the acronym CTP. This acronym stands for Certification Test Protocols. The U.S. Army has been implementing the use of 4-wire Kelvin Ohm measurements by stimulating the wiring paths with a set decade method using current. Using automated testing, the testing event takes seconds and minutes for multiple wiring paths. Results are then compared against each other to locate the degraded condition. Based on over six years of use across the department of defense, it has been able to detect the root cause of the intermittent system faults effectively. CTP type measurement method does not require environmental chambers or vibration of the weapon system under test. For more information please Google Certification Test Protocols.

Troubleshooting techniques

Some techniques to resolve intermittent faults are:

Related Research Articles

<span class="mw-page-title-main">Electromagnetic compatibility</span> Electrical engineering concept

Electromagnetic compatibility (EMC) is the ability of electrical equipment and systems to function acceptably in their electromagnetic environment, by limiting the unintentional generation, propagation and reception of electromagnetic energy which may cause unwanted effects such as electromagnetic interference (EMI) or even physical damage to operational equipment. The goal of EMC is the correct operation of different equipment in a common electromagnetic environment. It is also the name given to the associated branch of electrical engineering.

<span class="mw-page-title-main">Time-domain reflectometer</span> Electronic instrument

A time-domain reflectometer (TDR) is an electronic instrument used to determine the characteristics of electrical lines by observing reflected pulses.

<span class="mw-page-title-main">Overclocking</span> Practice of increasing the clock rate of a computer to exceed that certified by the manufacturer

In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated speeds. Semiconductor devices operated at higher frequencies and voltages increase power consumption and heat. An overclocked device may be unreliable or fail completely if the additional heat load is not removed or power delivery components cannot meet increased power demands. Many device warranties state that overclocking or over-specification voids any warranty, but some manufacturers allow overclocking as long as it is done (relatively) safely.

Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.

<span class="mw-page-title-main">Arc-fault circuit interrupter</span> Circuit breaker that protects against intermittent faults associated with arcing

An arc-fault circuit interrupter (AFCI) or arc-fault detection device (AFDD) is a circuit breaker that breaks the circuit when it detects the electric arcs that are a signature of loose connections in home wiring. Loose connections, which can develop over time, can sometimes become hot enough to ignite house fires. An AFCI selectively distinguishes between a harmless arc, and a potentially dangerous arc.

Failure mode and effects analysis is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines (IBM) as a term to describe the robustness of their mainframe computers.

Failure analysis is the process of collecting and analyzing data to determine the cause of a failure, often with the goal of determining corrective actions or liability. According to Bloch and Geitner, ”machinery failures reveal a reaction chain of cause and effect… usually a deficiency commonly referred to as the symptom…”. Failure analysis can save money, lives, and resources if done correctly and acted upon. It is an important discipline in many branches of manufacturing industry, such as the electronics industry, where it is a vital tool used in the development of new products and for the improvement of existing products. The failure analysis process relies on collecting failed components for subsequent examination of the cause or causes of failure using a wide array of methods, especially microscopy and spectroscopy. Nondestructive testing (NDT) methods are valuable because the failed products are unaffected by analysis, so inspection sometimes starts using these methods.

Condition monitoring is the process of monitoring a parameter of condition in machinery, in order to identify a significant change which is indicative of a developing fault. It is a major component of predictive maintenance. The use of condition monitoring allows maintenance to be scheduled, or other actions to be taken to prevent consequential damages and avoid its consequences. Condition monitoring has a unique benefit in that conditions that would shorten normal lifespan can be addressed before they develop into a major failure. Condition monitoring techniques are normally used on rotating equipment, auxiliary systems and other machinery like belt-driven equipment,, while periodic inspection using non-destructive testing (NDT) techniques and fit for service (FFS) evaluation are used for static plant equipment such as steam boilers, piping and heat exchangers.

<span class="mw-page-title-main">Predictive maintenance</span> Method to predict when equipment should be maintained

Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintenance, because tasks are performed only when warranted. Thus, it is regarded as condition-based maintenance carried out as suggested by estimations of the degradation state of an item.

In an electric power system, a fault or fault current is any abnormal electric current. For example, a short circuit is a fault in which a live wire touches a neutral or ground wire. An open-circuit fault occurs if a circuit is interrupted by a failure of a current-carrying wire or a blown fuse or circuit breaker. In three-phase systems, a fault may involve one or more phases and ground, or may occur only between phases. In a "ground fault" or "earth fault", current flows into the earth. The prospective short-circuit current of a predictable fault can be calculated for most situations. In power systems, protective devices can detect fault conditions and operate circuit breakers and other devices to limit the loss of service due to a failure.

In electrical engineering, electrical safety testing is essential to make sure electrical products and installations are safe. To meet this goal, governments and various technical bodies have developed electrical safety standards. All countries have their own electrical safety standards that must be complied with. To meet to these standards, electrical products and installations must pass electrical safety tests.

Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be distinguished: A direct pattern recognition of sensor readings that indicate a fault and an analysis of the discrepancy between the sensor readings and expected values, derived from some model. In the latter case, it is typical that a fault is said to be detected if the discrepancy or residual goes above a certain threshold. It is then the task of fault isolation to categorize the type of fault and its location in the machinery. Fault detection and isolation (FDI) techniques can be broadly classified into two categories. These include model-based FDI and signal processing based FDI.

ISO 26262, titled "Road vehicles – Functional safety", is an international standard for functional safety of electrical and/or electronic systems that are installed in serial production road vehicles, defined by the International Organization for Standardization (ISO) in 2011, and revised in 2018.

Spread-spectrum time-domain reflectometry (SSTDR) is a measurement technique to identify faults, usually in electrical wires, by observing reflected spread spectrum signals. This type of time-domain reflectometry can be used in various high-noise and live environments. Additionally, SSTDR systems have the additional benefit of being able to precisely locate the position of the fault. Specifically, SSTDR is accurate to within a few centimeters for wires carrying 400 Hz aircraft signals as well as MIL-STD-1553 data bus signals. AN SSTDR system can be run on a live wire because the spread spectrum signals can be isolated from the system noise and activity.

Noise-domain reflectometry is a type of reflectometry where the reflectometer exploits existing data signals on wiring and does not have to generate any signals itself. Noise-domain reflectometry, like time-domain and spread-spectrum time domain reflectometers, is most often used in identifying the location of wire faults in electrical lines.

No fault found (NFF), no trouble found (NTF) or no defect found (NDF) are terms used in the field of maintenance, where a unit is removed from service following a complaint of a perceived fault by operators or an alarm from its BIT equipment. The unit is then checked, but no anomaly is detected by the maintainer. Consequently, the unit is returned to service with no repair performed.

Reflectometry is a general term for the use of the reflection of waves or pulses at surfaces and interfaces to detect or characterize objects, sometimes to detect anomalies as in fault detection and medical diagnosis.

References

  1. 1 2 Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN   1043-7398.
  2. Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.
  3. Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"
  4. "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today " Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"
  5. "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"
  6. GEN-2023-SOMM-I-001 (Automatic Wire Test Set(AWTS) Implementation) Final.pdf
    MSG DTG 2000Z 2 MAY 2023 FROM COMMANDER, TAPO, JOINT BASE LANGLEY- EUSTIS, VA //AMSAM-SPT/