The Science of Network Troubleshooting
By stretch | Wednesday, March 10, 2010 at 4:19 a.m. UTC
A number of people have written asking me what happened to a paper I wrote back
in 2008 entitled 'The Science of Network
Troubleshooting.' Unfortunately, I neglected to republish the paper after
revamping packetlife.net in late 2009, so here it is again
as a blog article.
Troubleshooting is not an art. Along with many other IT methodologies, it is
often referred to as an art, but it's not. It's a science, if
ever there was one. Granted, someone with great skill in troubleshooting can
make it seem like a natural talent, the same way a
professional ball player makes hitting a home run look easy, when in fact it is
a learned skill. Another common misconception holds
troubleshooting as a skill derived entirely from experience with the involved
technologies. While experience is certainly beneficial
the ability to troubleshoot effectively arises primarily from the embrace of a
systematic process, a science.
It's said that troubleshooting can't be taught, but I disagree. More
accurately, I would argue that troubleshooting can't be taught
easily, or to great detail. This is because traditional education encompasses
how a technology functions; troubleshooting
encompasses all the ways in which it can cease to function. Given that it's
virtually impossible to identify and memorize all the
potential points of failure a system ornetwork might hold, engineers must
instead learn a process for identifying and resolving
malfunctions as they occur. To borrow a cliché analogy, teach a man to
identify why a fish is broken, rather than expecting him to
memorize all the ways a fish might break.
Troubleshooting as a Process
Essentially, troubleshooting is the correlation
between cause and effect. Your proxy server experiences a hard disk failure,
and you
can no longer access web pages. A backhoe digs up a fiber, and you can't call a
branch office. Cause, and effect. Moving forward
the correlation is obvious; the difficulty lies in transitioning from effect to
cause, and this is troubleshooting at its core.
Consider walking into a dark room. The light is off, but you don't know why.
This is the observed effect for which we need to
identify a cause. Instinctively, you'll reach for the light switch. If the
light switch is on, you'll search for another cause. Maybe the
power is out. Maybe the breaker has been tripped. Maybe someone stole all the
light bulbs (it happens). Without much thought
you investigate each of these possible causes in order of convenience or
likelihood. Subconsciously, you're applying a process to
resolve the problem.
Even though our light bulb analogy is admittedly simplistic, it serves to
illustrate the fundamentals of troubleshooting. The same
concepts are scalable to exponentially more complex scenarios. From a
high-level view,the troubleshooting process can be
reduced to a few core steps:
•
Identify the effect(s)
•
Eliminate suspect causes
•
Devise a solution
•
Test and repeat
•
Mitigate
Step 1: Identify the Effect(s)
If you've been a network engineer for more than a few hours, you've been told
at least once that the Internet is down. Yes, the
global information infrastructure some forty years in the making has fallen to
its knees and is in a state of complete chaos. All this
is, of course, confirmed by Mary in accounting. Last time it was discovered her
Ethernet cable had come unplugged, but this time
she's certain it's a global catastrophe.
Correctly identifying the effects of an outage or change is the most critical
step in troubleshooting. A poor judgment at this first step
will likely start you down the wrong path, wasting time and resources.
Identifying an effect is not to be confused with deducing a
probable cause; in this step we are focused solely on listing the ways in which
network operation has deviated from the norm.
Identifying effects is best done without assumption or emotion. While your mind
will naturally leap to possible causes at the first
report of an outage, you must force yourself to adopt an objective stance and
investigate the noted symptoms without bias. In the
https://packetlife.net/blog/2010/mar/10/the-science-of-network-troubleshooting/
Page 1
case of Mary's doomsdayforecast, you would likely want to confirm the condition
yourself before alerting the authorities.
Some key points to consider
What was working and has stopped?
An important consideration is whether an absent service was ever present to
begin with. A user may report an inability to reach
FTP sites as an outage, not realizing FTP connections have always been blocked
by the firewall as a matter of policy.
What wasn't working and has started?
This is can be a much less obvious change, but no less important. One example
would be the easing of restrictions on traffic types
or bandwidth, perhaps due to routing through an alternate path, or the deletion
of an access control mechanism.
What has continued to work?
Has all network access been severed, or merely certain
types of traffic? Or only certain destinations? Has a
contingency system
assumed control from a failed production system?
When was the change observed?
This critical point is very often neglected. Timing is imperative for
correlation with other events, as we'll soon see. Also remember
that we are often limited to noting the time a change was observed, rather than
when it occurred. For example, an outage observed
Monday morning could have easily occurred at any time during the preceding
weekend.
Who is affected? Who isn't?
Is the change limited to a certain group of users or devices? Is it constrained
to a geographical or logical area? Is any person orservice immune?
Is the condition intermittent?
Does the condition disappear and reappear? Does this happen at predictable
intervals, or does it appear to be random?
Has this happened before?
Is this a recurring problem? How long ago did it happen last? What was the
resolution? (You do keep logs of this sort of thing
right?)
Correlation with planned maintenance and configuration changes
Was something else being changed at this time? Was a
device added, removed, or replaced? Did the outage occur during a
scheduled maintenance window, either locally or at another site or provider?
Step 2: Eliminate Suspect Causes
Once we have a reliable account of the effect or
effects, we can attempt to deduce probable causes. I say probable because
deducing all possible causes is impractical, if not impossible. One possible
cause is a power failure. Another possible cause is
spontaneous combustion. Only one of these possible causes is probable.
There is a popular mantra of 'always start with layer one,'
suggesting that the physical connectivity of a network should be verified
before working on the higher layers. I disagree, as this is misleading and
often impractical. You're not going to drive out to a
remote site to verify everything is plugged in if a simple ping verifies
end-to-end connectivity. Similarly, it's unlikely that any cables
were disturbed if you can verify with relative certainty no one has gone near
thesuspect devices. Perhaps this is an oversimplified
argument, but verifying physical connectivity is often needlessly time
consuming and superseded by alternative methods.
https://packetlife.net/blog/2010/mar/10/the-science-of-network-troubleshooting/
Page 2
Instead, I suggest narrowing causes in order of combined probability and
convenience. For example, there might be nothing to
indicate DNS is returning an invalid response, but performing a manual name
resolution takes roughly two seconds, so this is
easily justified. Conversely, comparing a current device configuration to its
week-old backup and accounting for any differences
may take a considerable amount of time, but this presents a high probability of
exposing a cause, so it too is justified.
The order in which you decide to eliminate suspect causes is ultimately
dependent on your experience, your familiarity with the
infrastructure, and your allowance for time. Regardless of priority, each
suspect cause should undergo the same process of
elimination
Define a working condition
You can't test for a condition unless you know what condition to expect. Before
performing a test, you should have in mind what
outcome should be produced in the absence of an outage. For example, performing
a traceroute to a distant node is meaningless if
you can't compare it against a traceroute to the same destination under normal
conditions.
Define a test for that conditionEnsure that the test you perform is in fact
evaluating the suspect cause. For instance, pinging an E-mail server doesn't
explicitly
guarantee that mail services are available, only the server itself
(technically, only that server's address). To verify the presence of
mail services, a connection to the relevant daemon(s) must be established.
Apply the test and record the result
Once you've applied the test, record its success or
failure in your notes. Even if you've eliminated the cause under suspicion, you
have a reference to remind you of this and avoid wasting time repeating the
same test again unnecessarily.
It is common to uncover multiple failures in the course of troubleshooting.
When this happens, it is important to recognize any
dependencies. For example, if you discover that E-mail, web access, and a trunk
link are all down, the E-mail and web failures can
likely be ignored if they depend on the trunk link to function. However, always
remember to verify these supposed secondary
outages after the primary outage has been resolved.
Step 3: Devise a Solution
Once we have identified a point of failure, we want to continue our systematic
approach. Just as with testing for failures, we can
apply a simple methodology to testing for solutions. In fact, the process very
closely mirrors the steps performed to eliminate
suspect causes.
Define the failure
At this point you should have a comfortable idea of
thefailure. Form a detailed description so you have something to test against
after applying a solution. For example, you would want to refine 'The
Internet is down' to 'Users in building 10 cannot access the
Internet because their subnet was removed from the outbound ACL on the
firewall.'
Define the proposed solution
Describe exactly what changes are to be made, and exactly what the expected
outcome is. Blanket solutions such as arbitrarily
rebooting a device or rebuilding a configuration from scratch might fix the
problem, but they prevent any further diagnosis and
consequently impede mitigation.
Apply the solution and record the result
Once we have a defined failure and a proposed
solution, it's time to pit the two against each other. Be observant in applying
the
solution, and record its result. Does the outcome match what you expected? Has
the failure been resolved?
In addition to our defined process, some guidelines are well worth mentioning.
Maintain focus
https://packetlife.net/blog/2010/mar/10/the-science-of-network-troubleshooting/
Page 3
Far too often I encounter a technician who, upon becoming frustrated with a
failure or failures, opts to recklessly reboot,
reconfigure, or replace a device instead of troubleshooting systematically.
This is the high-tech equivalent of pounding something
with a hammer until it works. Focus on one failure at a time,
and one solution at a time per failure.
Watch outfor hazardous changes
When developing a solution, remember to evaluate what
effect it might have on systems unrelated to those being troubleshot. It's a
horrible feeling to realize you've fixed one problem at the expense of causing
a much larger one. The best course of action when
this happens is typically to immediately reverse the change which was made.
Note that this is only possible with a systematic
approach.
Step 4: Test and Repeat
Upon implementing a solution and observing a positive effect, we can begin to
retrace our steps back toward the original
symptoms. If any conditions were overlooked because they were decided to be
dependent on the recently resolved failure, test for
them again. Refer to your notes from the initial report and verify that each
symptom has been resolved. Ensure that the same tests
which were used to identify a failure are used to confirm functionality.
If you notice that a failure or failures remain, pick up where you left off in
the testing cycle, annotate it, and press forward.
Step 5: Mitigate
The troubleshooting process does not end when the
problem has been resolved and everyone is happy. All of your hard work up
until this point amounts to very little if the same problem occurs again
tomorrow. In IT, problems are commonly fixed without ever
being resolved. Band-aids and hasty installations are not acceptable
substitutes for implementing a permanent and reliable
solution. Soto speak, many people will go on mopping the floor day after day
without ever plugging the leak.
A permanent solution may be as complex as redesigning the routed backbone, or
as simple as moving a power strip somewhere it
won't be tripped on anymore. A permanent solution also doesn't have to be 100%
effective, but it should be as effective as is
practical. At the absolute minimum, ensure that you record the observed failure
and the applied solution, so that if it the condition
does recur you have an accurate and dated reference.
A Final Word
Everyone has his or her own preference in
troubleshooting, and by no means do I consider this paper conclusive. However,
if
there's only one concept you take away, make it this: above all else, remain
calm. You're no good to anyone in a panic. One's
ability to manage stress and maintain a professional demeanor even in the face
of utter chaos is what makes a good engineer
great.
Most network outages, despite what we're led to believe by senior management,
are not the end of the world. There are instances
where downtime can lead to loss of life; fortunately, this isn't the case with
the vast majority of networks. Money may be lost, time
may be wasted, and feelings may be hurt, but when the problem is finally
resolved, odds are you've learned something valuable.
Posted in Education
https://packetlife.net/blog/2010/mar/10/the-science-of-network-troubleshooting/
Page 4