Troubleshooting

Last updated

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Contents

Diagnostics

In general, troubleshooting is the identification or diagnosis of "trouble" in the management flow of a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying the causes of these symptoms.

A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example, selecting the "print" option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example). Corrective action can then be taken to prevent further failures of a similar kind.

The methods of forensic engineering are useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failures. Corrective action can then be taken to prevent further failure of a similar kind. Preventive action is possible using failure mode and effects (FMEA) and fault tree analysis (FTA) before full-scale production, and these methods can also be used for failure analysis.

There are two major elements required to enable a troubleshooting diagnosis to take place - à priori domain knowledge and search strategies. [1] These are interdependent, and here is where we can identify fundamentally two different types of problem, with matching approaches to their diagnosis. Rasmussen [2] suggested there is strategy guided by the characteristics of the correct functioning of the device (topographic strategy), and strategy guided by the characteristics of abnormal functioning (symptomatic strategy). The second is really asking “what’s wrong?” the first is asking “what’s happening?”

A strategy is an organized set of activities expressing a plausible way of achieving a goal. Strategies should not be viewed as algorithms, inflexibly followed to solutions. Problem solvers behave opportunistically, adjusting activities within a strategy and changing strategies and tactics in response to information and ideas. [3]

A symptomatic strategy (also known as cased-based reasoning, or shallow reasoning) requires à priori domain knowledge that is gleaned from past experience which established connections between symptoms and causes. This knowledge is referred to as shallow, compiled, evidential, history-based as well as case-based knowledge. This is the strategy most associated with diagnosis by experts. Diagnosis of a problem transpires as a rapid recognition process in which symptoms evoke appropriate situation categories. [4] An expert knows the cause by virtue of having previously encountered similar cases. Cased based reasoning is the most powerful strategy, and that used most commonly. However, the strategy won’t work independently with truly novel problems, or where deeper understanding of whatever is taking place is sought. A topographic strategy falls into the category of deep reasoning. With deep reasoning, in-depth knowledge of a system is used. Topography in this context means a description or an analysis of a structured entity, showing the relations among its elements. [5] Also known as reasoning from first principles, [6] deep reasoning is applied to novel faults when experience-based approaches aren’t viable. The topographic strategy is therefore linked to à priori domain knowledge that is developed from a more a fundamental understanding of a system, possibly using first-principles knowledge. Such knowledge is referred to as deep, causal or model-based knowledge. [7]

Hoc [8] noted that symptomatic approaches may need to be supported by topographic approaches because symptoms can be defined in diverse terms. The converse is also true – shallow reasoning can be used abductively to generate causal hypotheses, and deductively to evaluate those hypotheses, in a topographical search.

Aspects

Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example, a printer that "was working when it was plugged in over there"). However, there is a well known principle that correlation does not imply causality. (For example, the failure of a device shortly after it has been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore, troubleshooting demands critical thinking rather than magical thinking.

It is useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.

The first basic principle in troubleshooting is to be able to reproduce the problem, at wish. Second basic principle in troubleshooting is to reduce the "system" to its simplest form that still show the problem. Third basic principle in troubleshooting is to "know what you are looking for. In other words, to fully understand the way the system is supposed to work, so you can "spot" the error when it happens.

A troubleshooter could check each component in a system one by one, substituting known good components for each potentially suspect one. However, this process of "serial substitution" can be considered degenerate when components are substituted without regard to a hypothesis concerning how their failure could result in the symptoms being diagnosed.

Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to "bisection" troubleshooting techniques.

It also helps to start from a known good state, the best example being a computer reboot. A cognitive walkthrough is also a good thing to try. Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.

A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. This is especially bad if accompanied by habituation, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it is not apparent on a casual inspection which part is being used.

Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.

Some computerized troubleshooting services (such as Primefax, later renamed MaxServ), immediately show the top 10 solutions with the highest probability of fixing the underlying problem. The technician can either answer additional questions to advance through the troubleshooting procedure, each step narrowing the list of solutions, or immediately implement the solution he feels will fix the problem. These services give a rebate if the technician takes an additional step after the problem is solved: report back the solution that actually fixed the problem. The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms. [9] [10]

Half-splitting

Efficient methodical troubleshooting starts on with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes. This approach is often called "divide and conquer".

Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer's light is on and that its cable is firmly seated at both ends). This is often referred to as "milking the front panel." [11]

Then, "bisect" the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).

This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It is simply the application of a binary search across the range of dependencies and is often referred to as "half-splitting". [12] It is similar to the game of "twenty questions": Anyone can isolate one option out of a million by dividing the set of alternatives in half 20 times (because 2^10 = 1024 and 2^20 = 1,048,576).

Reproducing symptoms

One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.

Intermittent symptoms

Some of the most difficult troubleshooting issues relate to symptoms which occur intermittently. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.

In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.

Intermittent issues can be thus defined:

An intermittent is a problem for which there is no known procedure to consistently reproduce its symptom.

Steven Litt, [13]

In particular he asserts that there is a distinction between the frequency of occurrence and a "known procedure to consistently reproduce" an issue. For example, knowing that an intermittent problem occurs " within" an hour of a particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of observable exhibitions of the symptom.

Nevertheless, sometimes troubleshooters must resort to statistical methods ... and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved.

Also, tests may be run to stress certain components to determine if those components have failed. [14]

Multiple problems

Isolating single component failures that cause reproducible symptoms is relatively straightforward.

However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features that add redundancy, fault detection and failover to a system may also be subject to failure, and enough different component failures in any system will "take it down."

Even in simple systems, the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly, the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).

Note that, while we talk about "replacing components" the resolution of many problems involves adjustments or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose contacts" might simply need to be cleaned and/or tightened. All discussion of "replacement" should be taken to mean "replacement or adjustment or other modification."

See also

Related Research Articles

Knowledge representation and reasoning is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning, such as the application of rules or the relations of sets and subsets.

Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain. Major logic programming language families include Prolog, answer set programming (ASP) and Datalog. In all of these languages, rules are written in the form of clauses:

<span class="mw-page-title-main">Case-based reasoning</span> Process of solving new problems based on the solutions of similar past problems

In artificial intelligence and philosophy, case-based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions of similar past problems.

In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. It is widely used in IT operations, manufacturing, telecommunications, industrial process control, accident analysis, medicine, healthcare industry, etc. Root cause analysis is a form of inductive and deductive inference.

<span class="mw-page-title-main">Symbolic artificial intelligence</span> Methods in artificial intelligence research

In artificial intelligence, symbolic artificial intelligence is the term for the collection of all methods in artificial intelligence research that are based on high-level symbolic (human-readable) representations of problems, logic and search. Symbolic AI used tools such as logic programming, production rules, semantic nets and frames, and it developed applications such as knowledge-based systems, symbolic mathematics, automated theorem provers, ontologies, the semantic web, and automated planning and scheduling systems. The Symbolic AI paradigm led to seminal ideas in search, symbolic programming languages, agents, multi-agent systems, the semantic web, and the strengths and limitations of formal knowledge and reasoning systems.

Technical support are also known as a tech support is a call centre type customer service provided by companies to advise and assist registered users with issues concerning their technical products. Traditionally done on the phone, technical support can now be conducted online or through chat. At present, most large and mid-size companies have outsourced their tech support operations. Many companies provide discussion boards for users of their products to interact; such forums allow companies to reduce their support costs without losing the benefit of customer feedback.

A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine generals problem", developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

Failure mode and effects analysis is the process of reviewing as many components, assemblies, and subsystems as possible to identify potential failure modes in a system and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database. It was one of the first highly structured, systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. An FMEA is often the first step of a system reliability study.

A knowledge-based system (KBS) is a computer program that reasons and uses a knowledge base to solve complex problems. The term is broad and refers to many different kinds of systems. The one common theme that unites all knowledge based systems is an attempt to represent knowledge explicitly and a reasoning system that allows it to derive new knowledge. Thus, a knowledge-based system has two distinguishing features: a knowledge base and an inference engine.

<span class="mw-page-title-main">Problem solving</span> Approaches to problem solving

Problem solving is the process of achieving a goal by overcoming obstacles, a frequent part of most activities. Problems in need of solutions range from simple personal tasks to complex issues in business and technical fields. The former is an example of simple problem solving (SPS) addressing one issue, whereas the latter is complex problem solving (CPS) with multiple interrelated obstacles. Another classification is into well-defined problems with specific obstacles and goals, and ill-defined problems in which the current situation is troublesome but it is not clear what kind of resolution to aim for. Similarly, one may distinguish formal or fact-based problems requiring psychometric intelligence, versus socio-emotional problems which depend on the changeable emotions of individuals or groups, such as tactful behavior, fashion, or gift choices.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Five whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question "Why?" five times. The answer to the fifth why should reveal the root cause of the problem.

Operations, administration, and management or operations, administration, and maintenance are the processes, activities, tools, and standards involved with operating, administering, managing and maintaining any system. This commonly applies to telecommunication, computer networks, and computer hardware.

<span class="mw-page-title-main">Outline of thought</span> Overview of and topical guide to thought

The following outline is provided as an overview of and topical guide to thought (thinking):

An intermittent fault, often called simply an "intermittent", is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.

The worked-example effect is a learning effect predicted by cognitive load theory. Specifically, it refers to improved learning observed when worked examples are used as part of instruction, compared to other instructional techniques such as problem-solving and discovery learning. According to Sweller: "The worked example effect is the best known and most widely studied of the cognitive load effects".

Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine "cause and effect". In systems engineering and computer science, it is typically used to determine the causes of symptoms, mitigations, and solutions.

<span class="mw-page-title-main">Medical diagnosis</span> Process to identify a disease or disorder

Medical diagnosis is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as diagnosis with the medical context being implicit. The information required for diagnosis is typically collected from a history and physical examination of the person seeking medical care. Often, one or more diagnostic procedures, such as medical tests, are also done during the process. Sometimes the posthumous diagnosis is considered a kind of medical diagnosis.

<span class="mw-page-title-main">IDEF6</span>

IDEF6 or Integrated Definition for Design Rationale Capture is a method to facilitate the acquisition, representation, and manipulation of the design rationale used in the development of enterprise systems. This method, that wants to define the motives that drive the decision-making process, is still in development. Rationale is the reason, justification, underlying motivation, or excuse that moved the designer to select a particular strategy or design feature. More simply, rationale is interpreted as the answer to the question, “Why is this design being done in this manner?” Most design methods focus on what the design is.

In information technology a reasoning system is a software system that generates conclusions from available knowledge using logical techniques such as deduction and induction. Reasoning systems play an important role in the implementation of artificial intelligence and knowledge-based systems.

References

  1. Venkatasubramanian, Venkat, Raghunathan Rengaswamy, and Surya N. Kavuri. "A review of process fault detection and diagnosis: Part II: Qualitative models and search strategies." Computers & chemical engineering 27.3 (2003): 313-326.
  2. Rasmussen, Jens. Information processing and human-machine interaction. An approach to cognitive engineering. North-Holland, 1987.
  3. Lesgold, Alan, and Susanne Lajoie. "Complex problem solving in electronics." Complex problem solving: Principles and mechanisms (1991): 287-316.
  4. Gilhooly, Kenneth J. "Cognitive psychology and medical diagnosis." Applied cognitive psychology 4.4 (1990): 261-272.
  5. American Heritage Dictionary.
  6. Davis, Randall. "Reasoning from first principles in electronic troubleshooting." International Journal of Man-Machine Studies 19.5 (1983): 403-423.
  7. Milne, Robert. "Strategies for diagnosis." IEEE transactions on systems, man, and cybernetics 17.3 (1987): 333-339.
  8. Hoc, Jean-Michel. "A method to describe human diagnostic strategies in relation to the design of human-machine cooperation." International Journal of Cognitive Ergonomics 4.4 (2000): 297-309.
  9. "Troubleshooting at your fingertips" by Nils Conrad Persson. "Electronics Servicing and Technology" magazine 1982 June.
  10. "Issues of Fault Diagnosis for Dynamic Systems" by Ron J. Patton, Paul M. Frank, Robert N. Clark.
  11. "Hewlett Packard Bench Briefs" (PDF). Hewlett Packard. Retrieved 14 October 2011.
  12. Sullivan, Mike (Nov 15, 2000). "Secrets of a super geek: Use half splitting to solve difficult problems". TechRepublic. Archived from the original on 8 July 2012. Retrieved 22 October 2010.
  13. "December 98 Troubleshooting Professional Magazine: Intermittents". www.troubleshooters.com. Retrieved 2020-10-14.
  14. "How to Troubleshoot a Computer Problem – joyojc.com". www.joyojc.com. Archived from the original on 2013-02-24. Retrieved 9 April 2018.