Troubleshooting is a form of
problem solving
Problem solving is the process of achieving a goal by overcoming obstacles, a frequent part of most activities. Problems in need of solutions range from simple personal tasks (e.g. how to turn on an appliance) to complex issues in business an ...
, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a
process of elimination
Process of elimination is a logical method to identify an entity of interest among several ones by excluding all other entities. In educational testing, it is a process of deleting options whereby the possibility of an option being correct is clos ...
—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.
In general, troubleshooting is the identification or
diagnosis
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems engin ...
of "trouble" in the management flow of a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying the causes of these symptoms.
A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example, selecting the "print" option from various computer applications is intended to result in a
hardcopy
Digital News was a trade publication that focused on products from Digital Equipment Corporation (DEC).
History
They published independently from 1986 thru 1992. At that point, they were acquired and merged with '' Digital Review'' with the new ...
emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example). Corrective action can then be taken to prevent further failures of a similar kind.
The methods of
forensic engineering
Forensic engineering has been defined as ''"the investigation of failures - ranging from serviceability to catastrophic - which may lead to legal activity, including both civil and criminal".'' It includes the investigation of materials, product ...
are useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific
failure
Failure is the state or condition of not meeting a desirable or intended objective (goal), objective, and may be viewed as the opposite of Success (concept), success. The criteria for failure depends on context, and may be relative to a parti ...
s. Corrective action can then be taken to prevent further failure of a similar kind. Preventive action is possible using
failure mode and effects (FMEA) and
fault tree analysis (FTA) before full-scale production, and these methods can also be used for
failure analysis Failure analysis is the process of collecting and analyzing data to determine the cause of a failure, often with the goal of determining corrective actions or liability.
According to Bloch and Geitner, ”machinery failures reveal a reaction chain o ...
.
Aspects
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example, a printer that "was working when it was plugged in over there"). However, there is a well known principle that
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
does not imply
causality
Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...
. (For example, the failure of a device shortly after it has been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of
coincidence
A coincidence is a remarkable concurrence of events or circumstances that have no apparent causal connection with one another. The perception of remarkable coincidences may lead to supernatural, occult, or paranormal claims, or it may lead t ...
.) Therefore, troubleshooting demands
critical thinking rather than
magical thinking
Magical thinking, or superstitious thinking, is the belief that unrelated events are causally connected despite the absence of any plausible causal link between them, particularly as a result of supernatural effects. Examples include the idea that ...
.
It is useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its
filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.
The first basic principle in troubleshooting is to be able to reproduce the problem, at wish.
Second basic principle in troubleshooting is to reduce the "system" to its simplest form that still show the problem.
Third basic principle in troubleshooting is to "know what you are looking for. In other words, to fully understand the way the system is supposed to work, so you can "spot" the error when it happens.
A troubleshooter could check each component in a
system
A system is a group of Interaction, interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment (systems), environment, is described by its boundaries, ...
one by one, substituting known good components for each potentially suspect one. However, this process of "serial substitution" can be considered degenerate when components are substituted without regard to a hypothesis concerning how their failure could result in the symptoms being diagnosed.
Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (
feedback loop
Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause-and-effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handled c ...
s). Such systems are less amenable to "bisection" troubleshooting techniques.
It also helps to start from a known good state, the best example being a computer
reboot
In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is physi ...
. A
cognitive walkthrough The cognitive walkthrough method is a usability inspection method used to identify usability issues in interactive systems, focusing on how easy it is for new users to accomplish tasks with the system. A cognitive walkthrough is task-specific, wher ...
is also a good thing to try. Comprehensive
documentation
Documentation is any communicable material that is used to describe, explain or instruct regarding some attributes of an object, system or procedure, such as its parts, assembly, installation, maintenance and use. As a form of knowledge manageme ...
produced by proficient
technical writer
A technical writer is a professional information communicator whose task is to transfer information between two or more parties, through any medium that best facilitates the transfer and comprehension of the information. Technical writers researc ...
s is very helpful, especially if it provides a
theory of operation
A theory of operation is a description of how a device or system should work. It is often included in documentation, especially maintenance/service documentation, or a user manual. It aids troubleshooting by providing the troubleshooter with ...
for the subject device or system.
A common cause of problems is bad
design
A design is a plan or specification for the construction of an object or system or for the implementation of an activity or process or the result of that plan or specification in the form of a prototype, product, or process. The verb ''to design'' ...
, for example bad
human factors
Human factors and ergonomics (commonly referred to as human factors) is the application of psychological and physiological principles to the engineering and design of products, processes, and systems. Four primary goals of human factors learnin ...
design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (
behavior-shaping constraint
A behavior-shaping constraint, also sometimes referred to as a forcing function or poka-yoke, is a technique used in error-tolerant design to prevent the user from making common errors or mistakes. One example is the reverse lockout on the tra ...
), or a lack of
error-tolerant
An error-tolerant design (also: human-error-tolerant design) is one that does not unduly penalize user or human errors. It is the human equivalent of fault tolerant design that allows equipment to continue functioning in the presence of hardware ...
design. This is especially bad if accompanied by
habituation
Habituation is a form of non-associative learning in which an innate (non-reinforced) response to a stimulus decreases after repeated or prolonged presentations of that stimulus. Responses that habituate include those that involve the intact org ...
, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it is not apparent on a casual inspection which part is being used.
Troubleshooting can also take the form of a systematic
checklist
A checklist is a type of job aid used in repetitive tasks to reduce failure by compensating for potential limits of human memory and attention. It helps to ensure consistency and completeness in carrying out a task. A basic example is the " to d ...
, troubleshooting procedure,
flowchart
A flowchart is a type of diagram that represents a workflow or process. A flowchart can also be defined as a diagrammatic representation of an algorithm, a step-by-step approach to solving a task.
The flowchart shows the steps as boxes of va ...
or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.
Some computerized troubleshooting services (such as Primefax, later renamed MaxServ), immediately show the top 10 solutions with the highest probability of fixing the underlying problem. The technician can either answer additional questions to advance through the troubleshooting procedure, each step narrowing the list of solutions, or immediately implement the solution he feels will fix the problem. These services give a rebate if the technician takes an additional step after the problem is solved: report back the solution that actually fixed the problem. The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms.
Half-splitting
Efficient methodical troubleshooting starts on with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes. This approach is often called "
divide and conquer".
Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer's light is on and that its cable is firmly seated at both ends). This is often referred to as "milking the front panel."
Then, "bisect" the system (for example in a network
printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).
This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It is simply the application of a
binary search
In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...
across the range of dependencies and is often referred to as "half-splitting". It is similar to the game of "
twenty questions": Anyone can isolate one option out of a million by dividing the set of alternatives in half 20 times (because 2^10 = 1024 and 2^20 = 1,048,576).
Reproducing symptoms
One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.
Intermittent symptoms
Some of the most difficult troubleshooting issues relate to
symptoms which occur intermittently. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.
In computer programming
race condition
A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of t ...
s often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.
Intermittent issues can be thus defined:
In particular he asserts that there is a distinction between the frequency of occurrence and a "known procedure to consistently reproduce" an issue. For example, knowing that an intermittent problem occurs " within" an hour of a particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of observable exhibitions of the symptom.
Nevertheless, sometimes troubleshooters must resort to statistical methods ... and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the
root cause has been found and that the problem is truly solved.
Also, tests may be run to stress certain components to determine if those components have failed.
Multiple problems
Isolating single component failures that cause reproducible symptoms is relatively straightforward.
However, many problems only occur as a result of multiple failures or errors. This is particularly true of
fault tolerant
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
systems, or those with built-in redundancy. Features that add redundancy, fault detection and
failover to a system may also be subject to failure, and enough different component failures in any system will "take it down."
Even in simple systems, the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly, the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).
Note that, while we talk about "replacing components" the resolution of many problems involves adjustments or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose contacts" might simply need to be cleaned and/or tightened. All discussion of "replacement" should be taken to mean "replacement or adjustment or other modification."
See also
*
5 Whys
Five whys (or 5 whys) is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating ...
*
Bathtub curve
The bathtub curve is widely used in reliability engineering and deterioration modeling. It describes a particular form of the hazard function which comprises three parts:
*The first part is a decreasing failure rate, known as early failures.
*Th ...
*
Cause and effect
*
Debugging
In computer programming and software development, debugging is the process of finding and resolving '' bugs'' (defects or problems that prevent correct operation) within computer programs, software, or systems.
Debugging tactics can involve in ...
*
Forensic engineering
Forensic engineering has been defined as ''"the investigation of failures - ranging from serviceability to catastrophic - which may lead to legal activity, including both civil and criminal".'' It includes the investigation of materials, product ...
*
No Trouble Found
*
Problem solving
Problem solving is the process of achieving a goal by overcoming obstacles, a frequent part of most activities. Problems in need of solutions range from simple personal tasks (e.g. how to turn on an appliance) to complex issues in business an ...
*
Root cause analysis
In science
Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe.
Science may be as old as the human species, and some of the earliest archeologic ...
*
RPR Problem Diagnosis
References
{{Authority control
Problem solving