Intermittent fault

Last updated August 21, 2024 • 6 min readFrom Wikipedia, The Free Encyclopedia

An intermittent fault, often called simply an "intermittent"^{[ citation needed ]} (or anecdotally "interfailing"^{[ citation needed ]}), is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.

Intermittent faults are not easily repeatable because of their complicated behavioral patterns. These are also sometimes referred to as “soft” failures, since they do not manifest themselves all the time and disappear in an unpredictable manner. In contrast, “hard” failures are permanent failures that occur over a period of time (or are sometimes instantaneous). They have a specific failure site (location of failure), mode (how the failure manifests itself), and mechanism, and there is no unpredictable recovery for the failed system. Since intermittent faults are not easily repeatable, it is more difficult to conduct a failure analysis for them, understand their root causes, or isolate their failure site than it is for permanent failures.^[1]

Intermittent failures can be a cause of no-fault-found (NFF) occurrences in electronic products and systems. NFF implies that a failure (fault) occurred or was reported to have occurred during a product’s use. The product was analyzed or tested to confirm the failure, but “a failure or fault” could be not found. A common example of the NFF phenomenon occurs when your computer “hangs up”. Clearly, a “failure” has occurred. However, if the computer is rebooted, it often works again. The impact of NFF and intermittent failures can be profound. Due to their characteristics, manufacturers may assume a cause(s) rather than spend the time and cost to determine a root cause. For example, a hard drive supplier claimed NFFs were not failures and allowed all NFF products to be returned to the field. Later it was determined that these products had a significantly higher return rate, suggesting that the NFF condition was actually a result of intermittent failures in the product. The result was increased maintenance costs, decreased equipment availability, increased customer inconvenience, reduced customer confidence, damaged company reputation, and in some cases potential safety hazards.^[2]

A simple example of an effectively random cause in a physical system is a borderline electrical connection in the wiring or a component of a circuit, where (cause 1, the cause that must be identified and rectified) two conductors may touch subject to (cause 2, which need not be identified) a minor change in temperature, vibration, orientation, voltage, etc. (Sometimes this is described as an "intermittent connection" rather than "fault".) In computer software a program may (cause 1) fail to initialise a variable which is required to be initially zero; if the program is run in circumstances such that memory is almost always clear before it starts, it will malfunction on the rare occasions that (cause 2) the memory where the variable is stored happens to be non-zero beforehand.

Intermittent faults are notoriously difficult to identify and repair ("troubleshoot") because each individual factor does not create the problem alone, so the factors can only be identified while the malfunction is actually occurring. The person capable of identifying and solving the problem is seldom the usual operator. Because the timing of the malfunction is unpredictable, and both device or system downtime and engineers' time incur cost, the fault is often simply tolerated if not too frequent unless it causes unacceptable problems or dangers. For example, some intermittent faults in critical equipment such as medical life support equipment could result in killing a patient or in aeronautics causes a flight to be aborted or in some cases crash.

Troubleshooting techniques

Some techniques to resolve intermittent faults are:

Automatic logging of relevant parameters over a long enough time for the fault to manifest can help; parameter values at the time of the fault may identify the cause so that appropriate remedial action can be taken.
Changing operating circumstances while the fault is present to see if the fault temporarily clears or changes. For example, tapping components, cooling them with freezer spray, heating them. Striking the cabinet may temporarily clear the fault.
a database of similar faults which have been resolved in identical or similar equipment^[3]
precautionary changes, without attempting to pinpoint the fault. For example, electrolytic capacitors subject to high ripple currents can be changed as a routine measure, without bothering to troubleshoot the fault at all. Connectors can be disconnected and reseated. This is sometimes a measure of desperation; things are changed until the fault stops happening, and it is hoped that it is actually resolved rather than dormant.
In electrical systems and cable systems, time domain reflectometry techniques can be used: pulses are sent down electric wiring and the pulses reflected back are examined for anomalies, for example intermittent leakage during the stresses of aircraft operation; this can only be done for one test channel at time and is generally limited to intermittent faults >100milliseconds.^[4]
In complex, multiple channel systems, where the fault/s might be in an interconnection, the ideal method of finding an intermittent fault is to be able to monitor, detect and isolate all channels or electrical paths continuously and simultaneously. This methodology allows the system under test to benefit from continuous and complete test coverage while any environmental stressing of the system is performed. This type cannot be performed by scanning testing technology but needs to have some form of electronic neural-network which can perform these test without the need for any scanning and/or digital averaging; this testing regime is covered by the DoD's MIL-PRF-32516 published in March 2015 and it calls for testing technology to operate in the Class 1 category in order to combat intermittent faults effectively.^[5]
Three main methodologies to mitigate intermittent behavior in integrated circuits are dynamic instruction delaying, core frequency scaling, and thread migration. When the processor incurs more than the expected time to execute a process, time delay and timing violation occur. This fault may be avoided by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates the scheduling priorities during the execution of the system. The objective is to respond dynamically to the changing conditions and form a self-sustained, optimized configuration. Another approach to mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower frequency when less is needed and scales it up to a higher frequency when more is needed. Thread migration is another technique used to overcome intermittent failure. A thread is an ordered set of instructions that tells a computer exactly what to do. When a specific thread encounters failures, the content of the thread within the faulty computer core is transferred to another thread within an idle core, where the problem is addressed and solved.^[1]

Related Research Articles

<span class="mw-page-title-main">Digital electronics</span> Electronic circuits that utilize digital signals

Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is in contrast to analog electronics which work primarily with analog signals. Despite the name, digital electronics designs includes important analog design considerations.

<span class="mw-page-title-main">Time-domain reflectometer</span> Electronic instrument

A time-domain reflectometer (TDR) is an electronic instrument used to determine the characteristics of electrical lines by observing reflected pulses.

<span class="mw-page-title-main">Overclocking</span> Practice of increasing the clock rate of a computer to exceed that certified by the manufacturer

In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated speeds. Semiconductor devices operated at higher frequencies and voltages increase power consumption and heat. An overclocked device may be unreliable or fail completely if the additional heat load is not removed or power delivery components cannot meet increased power demands. Many device warranties state that overclocking or over-specification voids any warranty, but some manufacturers allow overclocking as long as it is done (relatively) safely.

A glitch is a short-lived technical fault, such as a transient one that corrects itself, making it difficult to troubleshoot. The term is particularly common in the computing and electronics industries, in circuit bending, as well as among players of video games. More generally, all types of systems including human organizations and nature experience glitches.

Common and special causes are the two distinct origins of variation in a process, as defined in the statistical thinking and methods of Walter A. Shewhart and W. Edwards Deming. Briefly, "common causes", also called natural patterns, are the usual, historical, quantifiable variation in a system, while "special causes" are unusual, not previously observed, non-quantifiable variation.

<span class="mw-page-title-main">Failure mode and effects analysis</span> Analysis of potential system failures

Failure mode and effects analysis is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, OR will operate in a defined environment without failure. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.

Failure analysis is the process of collecting and analyzing data to determine the cause of a failure, often with the goal of determining corrective actions or liability. According to Bloch and Geitner, ”machinery failures reveal a reaction chain of cause and effect… usually a deficiency commonly referred to as the symptom…”. Failure analysis can save money, lives, and resources if done correctly and acted upon. It is an important discipline in many branches of manufacturing industry, such as the electronics industry, where it is a vital tool used in the development of new products and for the improvement of existing products. The failure analysis process relies on collecting failed components for subsequent examination of the cause or causes of failure using a wide array of methods, especially microscopy and spectroscopy. Nondestructive testing (NDT) methods are valuable because the failed products are unaffected by analysis, so inspection sometimes starts using these methods.

Condition monitoring is the process of monitoring a parameter of condition in machinery, in order to identify a significant change which is indicative of a developing fault. It is a major component of predictive maintenance. The use of condition monitoring allows maintenance to be scheduled, or other actions to be taken to prevent consequential damages and avoid its consequences. Condition monitoring has a unique benefit in that conditions that would shorten normal lifespan can be addressed before they develop into a major failure. Condition monitoring techniques are normally used on rotating equipment, auxiliary systems and other machinery like belt-driven equipment,, while periodic inspection using non-destructive testing (NDT) techniques and fit for service (FFS) evaluation are used for static plant equipment such as steam boilers, piping and heat exchangers.

In engineering, a fault is a defect or problem in a system that causes it to fail or act abnormally.

The term downtime is used to refer to periods when a system is unavailable. The unavailability is the proportion of a time-span that a system is unavailable or offline. This is usually a result of the system failing to function because of an unplanned event, or because of routine maintenance.

In computing, a hang or freeze occurs when either a process or system ceases to respond to inputs. A typical example is when computer's graphical user interface no longer responds to the user typing on the keyboard or moving the mouse. The term covers a wide range of behaviors in both clients and servers, and is not limited to graphical user interface issues.

Memory testers are specialized test equipment used to test and verify memory modules.

In engineering, debugging is the process of finding the root cause of and workarounds and possible fixes for bugs.

ISO 26262, titled "Road vehicles – Functional safety", is an international standard for functional safety of electrical and/or electronic systems that are installed in serial production road vehicles, defined by the International Organization for Standardization (ISO) in 2011, and revised in 2018.

High performance computing applications run on massively parallel supercomputers consist of concurrent programs designed using multi-threaded, multi-process models. The applications may consist of various constructs with varying degree of parallelism. Although high performance concurrent programs use similar design patterns, models and principles as that of sequential programs, unlike sequential programs, they typically demonstrate non-deterministic behavior. The probability of bugs increases with the number of interactions between the various parallel constructs. Race conditions, data races, deadlocks, missed signals and live lock are common error types.

No fault found (NFF), no trouble found (NTF) or no defect found (NDF) are terms used in the field of maintenance, where a unit is removed from service following a complaint of a perceived fault by operators or an alarm from its BIT equipment. The unit is then checked, but no anomaly is detected by the maintainer. Consequently, the unit is returned to service with no repair performed.

References

1 2 Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN 1043-7398.
↑ Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.
↑ Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"
↑ "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today " Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"
↑ "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN 1043-7398.

[2] Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.

[3] Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"

[4] "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today " Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"

[5] "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"

[1]

[2]

[3]

[4]

[5]

Intermittent fault

Contents

Troubleshooting techniques

Related Research Articles

References

External links