Intermittent fault

Last updated June 05, 2024

An intermittent fault, often called simply an "intermittent"^{[ citation needed ]} (or anecdotally "interfailing"^{[ citation needed ]}), is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.

Intermittent faults are not easily repeatable because of their complicated behavioral patterns. These are also sometimes referred to as “soft” failures, since they do not manifest themselves all the time and disappear in an unpredictable manner. In contrast, “hard” failures are permanent failures that occur over a period of time (or are sometimes instantaneous). They have a specific failure site (location of failure), mode (how the failure manifests itself), and mechanism, and there is no unpredictable recovery for the failed system. Since intermittent faults are not easily repeatable, it is more difficult to conduct a failure analysis for them, understand their root causes, or isolate their failure site than it is for permanent failures.^[1]

Intermittent failures can be a cause of no-fault-found (NFF) occurrences in electronic products and systems. NFF implies that a failure (fault) occurred or was reported to have occurred during a product’s use. The product was analyzed or tested to confirm the failure, but “a failure or fault” could be not found. A common example of the NFF phenomenon occurs when your computer “hangs up”. Clearly, a “failure” has occurred. However, if the computer is rebooted, it often works again. The impact of NFF and intermittent failures can be profound. Due to their characteristics, manufacturers may assume a cause(s) rather than spend the time and cost to determine a root cause. For example, a hard drive supplier claimed NFFs were not failures and allowed all NFF products to be returned to the field. Later it was determined that these products had a significantly higher return rate, suggesting that the NFF condition was actually a result of intermittent failures in the product. The result was increased maintenance costs, decreased equipment availability, increased customer inconvenience, reduced customer confidence, damaged company reputation, and in some cases potential safety hazards.^[2]

A simple example of an effectively random cause in a physical system is a borderline electrical connection in the wiring or a component of a circuit, where (cause 1, the cause that must be identified and rectified) two conductors may touch subject to (cause 2, which need not be identified) a minor change in temperature, vibration, orientation, voltage, etc. (Sometimes this is described as an "intermittent connection" rather than "fault".) In computer software a program may (cause 1) fail to initialise a variable which is required to be initially zero; if the program is run in circumstances such that memory is almost always clear before it starts, it will malfunction on the rare occasions that (cause 2) the memory where the variable is stored happens to be non-zero beforehand.

Intermittent faults are notoriously difficult to identify and repair ("troubleshoot") because each individual factor does not create the problem alone, so the factors can only be identified while the malfunction is actually occurring. The person capable of identifying and solving the problem is seldom the usual operator. Because the timing of the malfunction is unpredictable, and both device or system downtime and engineers' time incur cost, the fault is often simply tolerated if not too frequent unless it causes unacceptable problems or dangers. For example, some intermittent faults in critical equipment such as medical life support equipment could result in killing a patient or in aeronautics causes a flight to be aborted or in some cases crash.

Troubleshooting techniques

Some techniques to resolve intermittent faults are:

Automatic logging of relevant parameters over a long enough time for the fault to manifest can help; parameter values at the time of the fault may identify the cause so that appropriate remedial action can be taken.
Changing operating circumstances while the fault is present to see if the fault temporarily clears or changes. For example, tapping components, cooling them with freezer spray, heating them. Striking the cabinet may temporarily clear the fault.
a database of similar faults which have been resolved in identical or similar equipment^[3]
precautionary changes, without attempting to pinpoint the fault. For example, electrolytic capacitors subject to high ripple currents can be changed as a routine measure, without bothering to troubleshoot the fault at all. Connectors can be disconnected and reseated. This is sometimes a measure of desperation; things are changed until the fault stops happening, and it is hoped that it is actually resolved rather than dormant.
In electrical systems and cable systems, time domain reflectometry techniques can be used: pulses are sent down electric wiring and the pulses reflected back are examined for anomalies, for example intermittent leakage during the stresses of aircraft operation; this can only be done for one test channel at time and is generally limited to intermittent faults >100milliseconds.^[4]
In complex, multiple channel systems, where the fault/s might be in an interconnection, the ideal method of finding an intermittent fault is to be able to monitor, detect and isolate all channels or electrical paths continuously and simultaneously. This methodology allows the system under test to benefit from continuous and complete test coverage while any environmental stressing of the system is performed. This type cannot be performed by scanning testing technology but needs to have some form of electronic neural-network which can perform these test without the need for any scanning and/or digital averaging; this testing regime is covered by the DoD's MIL-PRF-32516 published in March 2015 and it calls for testing technology to operate in the Class 1 category in order to combat intermittent faults effectively.^[5]
Three main methodologies to mitigate intermittent behavior in integrated circuits are dynamic instruction delaying, core frequency scaling, and thread migration. When the processor incurs more than the expected time to execute a process, time delay and timing violation occur. This fault may be avoided by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates the scheduling priorities during the execution of the system. The objective is to respond dynamically to the changing conditions and form a self-sustained, optimized configuration. Another approach to mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower frequency when less is needed and scales it up to a higher frequency when more is needed. Thread migration is another technique used to overcome intermittent failure. A thread is an ordered set of instructions that tells a computer exactly what to do. When a specific thread encounters failures, the content of the thread within the faulty computer core is transferred to another thread within an idle core, where the problem is addressed and solved.^[1]
Automatic testing using Certification Test Protocols (CTP) provides thorough effectiveness in detecting precursors to Electrical Wiring Interconnect System (EWIS) intermittent event type failure modes. CTP implements automatic testing using a circuit analyzer to use multiple current stimuli on EWIS companion wiring and comparing them for anomalous measurements. Use of CTP does not require flight emulation, shake/vibration, or physical movement to be successful. This ensures less costly methods than those posed in MIL-PRF-32516. ^{[ citation needed ]}^[6]

Related Research Articles

<span class="mw-page-title-main">Time-domain reflectometer</span> Electronic instrument

A time-domain reflectometer (TDR) is an electronic instrument used to determine the characteristics of electrical lines by observing reflected pulses.

<span class="mw-page-title-main">Overclocking</span> Practice of increasing the clock rate of a computer to exceed that certified by the manufacturer

In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated speeds. Semiconductor devices operated at higher frequencies and voltages increase power consumption and heat. An overclocked device may be unreliable or fail completely if the additional heat load is not removed or power delivery components cannot meet increased power demands. Many device warranties state that overclocking or over-specification voids any warranty, but some manufacturers allow overclocking as long as it is done (relatively) safely.

A power outage is the loss of the electrical power network supply to an end user.

A glitch is a short-lived fault in a system, such as a transient fault that corrects itself, making it difficult to troubleshoot. The term is particularly common in the computing and electronics industries, in circuit bending, as well as among players of video games. More generally, all types of systems including human organizations and nature experience glitches.

Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.

A residual-current device (RCD), residual-current circuit breaker (RCCB) or ground fault circuit interrupter (GFCI) is an electrical safety device that interrupts an electrical circuit when the current passing through a conductor is not equal and opposite in both directions, therefore indicating an improper flow of current such as leakage current to ground or current flowing to another powered conductor. The device's purpose is to reduce the severity of injury caused by an electric shock. Injury from shock is limited to the time before the electrical circuit is interrupted, but the victim may also sustain further injury, e.g. by falling after receiving a shock. This type of circuit interrupter can not distinguish between current flowing though power carrying conductors that passes through a person from current that passes through electrical equipment and offer no protection when a person touches both conductors at the same time.

An arc-fault circuit interrupter (AFCI) or arc-fault detection device (AFDD) is a circuit breaker that breaks the circuit when it detects the electric arcs that are a signature of loose connections in home wiring. Loose connections, which can develop over time, can sometimes become hot enough to ignite house fires. An AFCI selectively distinguishes between a harmless arc, and a potentially dangerous arc.

<span class="mw-page-title-main">Failure mode and effects analysis</span> Analysis of potential system failures

Failure mode and effects analysis is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Failure analysis is the process of collecting and analyzing data to determine the cause of a failure, often with the goal of determining corrective actions or liability. According to Bloch and Geitner, ”machinery failures reveal a reaction chain of cause and effect… usually a deficiency commonly referred to as the symptom…”. Failure analysis can save money, lives, and resources if done correctly and acted upon. It is an important discipline in many branches of manufacturing industry, such as the electronics industry, where it is a vital tool used in the development of new products and for the improvement of existing products. The failure analysis process relies on collecting failed components for subsequent examination of the cause or causes of failure using a wide array of methods, especially microscopy and spectroscopy. Nondestructive testing (NDT) methods are valuable because the failed products are unaffected by analysis, so inspection sometimes starts using these methods.

Condition monitoring is the process of monitoring a parameter of condition in machinery, in order to identify a significant change which is indicative of a developing fault. It is a major component of predictive maintenance. The use of condition monitoring allows maintenance to be scheduled, or other actions to be taken to prevent consequential damages and avoid its consequences. Condition monitoring has a unique benefit in that conditions that would shorten normal lifespan can be addressed before they develop into a major failure. Condition monitoring techniques are normally used on rotating equipment, auxiliary systems and other machinery like belt-driven equipment,, while periodic inspection using non-destructive testing (NDT) techniques and fit for service (FFS) evaluation are used for static plant equipment such as steam boilers, piping and heat exchangers.

In an electric power system, a fault or fault current is any abnormal electric current. For example, a short circuit is a fault in which a live wire touches a neutral or ground wire. An open-circuit fault occurs if a circuit is interrupted by a failure of a current-carrying wire or a blown fuse or circuit breaker. In three-phase systems, a fault may involve one or more phases and ground, or may occur only between phases. In a "ground fault" or "earth fault", current flows into the earth. The prospective short-circuit current of a predictable fault can be calculated for most situations. In power systems, protective devices can detect fault conditions and operate circuit breakers and other devices to limit the loss of service due to a failure.

In electrical engineering, electrical safety testing is essential to make sure electrical products and installations are safe. To meet this goal, governments and various technical bodies have developed electrical safety standards. All countries have their own electrical safety standards that must be complied with. To meet to these standards, electrical products and installations must pass electrical safety tests.

Fault detection, isolation, and recovery (FDIR) is a subfield of control engineering which concerns itself with monitoring a system, identifying when a fault has occurred, and pinpointing the type of fault and its location. Two approaches can be distinguished: A direct pattern recognition of sensor readings that indicate a fault and an analysis of the discrepancy between the sensor readings and expected values, derived from some model. In the latter case, it is typical that a fault is said to be detected if the discrepancy or residual goes above a certain threshold. It is then the task of fault isolation to categorize the type of fault and its location in the machinery. Fault detection and isolation (FDI) techniques can be broadly classified into two categories. These include model-based FDI and signal processing based FDI.

ISO 26262, titled "Road vehicles – Functional safety", is an international standard for functional safety of electrical and/or electronic systems that are installed in serial production road vehicles, defined by the International Organization for Standardization (ISO) in 2011, and revised in 2018.

Spread-spectrum time-domain reflectometry (SSTDR) is a measurement technique to identify faults, usually in electrical wires, by observing reflected spread spectrum signals. This type of time-domain reflectometry can be used in various high-noise and live environments. Additionally, SSTDR systems have the additional benefit of being able to precisely locate the position of the fault. Specifically, SSTDR is accurate to within a few centimeters for wires carrying 400 Hz aircraft signals as well as MIL-STD-1553 data bus signals. AN SSTDR system can be run on a live wire because the spread spectrum signals can be isolated from the system noise and activity.

Noise-domain reflectometry is a type of reflectometry where the reflectometer exploits existing data signals on wiring and does not have to generate any signals itself. Noise-domain reflectometry, like time-domain and spread-spectrum time domain reflectometers, is most often used in identifying the location of wire faults in electrical lines.

An arc fault is a high power discharge of electricity between two or more conductors. This discharge generates heat, which can break down the wire's insulation and trigger an electrical fire. Arc faults can range in current from a few amps up to thousands of amps, and are highly variable in strength and duration.

No fault found (NFF), no trouble found (NTF) or no defect found (NDF) are terms used in the field of maintenance, where a unit is removed from service following a complaint of a perceived fault by operators or an alarm from its BIT equipment. The unit is then checked, but no anomaly is detected by the maintainer. Consequently, the unit is returned to service with no repair performed.

References

1 2 Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN 1043-7398.
↑ Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.
↑ Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"
↑ "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today " Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"
↑ "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"
↑

MSG DTG 2000Z 2 MAY 2023 FROM COMMANDER, TAPO, JOINT BASE LANGLEY- EUSTIS, VA //AMSAM-SPT/

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN 1043-7398.

[2] Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.

[3] Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"

[4] "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today " Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"

[5] "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"

[6] 

MSG DTG 2000Z 2 MAY 2023 FROM COMMANDER, TAPO, JOINT BASE LANGLEY- EUSTIS, VA //AMSAM-SPT/

[1]

[2]

[3]

[4]

[5]

[6]

Intermittent fault

Contents

Troubleshooting techniques

Related Research Articles

References

External links