Teaching/developing troubleshooting skills

Pete Kruckenberg <pete@kruckenberg.com> 6/24/04 5:09:19 PM >>>

It's been so long since I learned network troubleshooting
techniques I can't remember how I learned them or even how I
used to do it (so poorly).

Does anyone have experience with developing a
skills-improvement program on this topic?

I find that it's helpful to teach troubleshooting in two stages: 1)
Define the problem. 2) Isolate the problem

For stage one, teach them the basic skillset needed to define the issue
in a general way based on available information. Is a circuit obviously
down? Are certain destinations unreachable? Are *all* destinations
unreachable? Is network access slow? You get the picture.

Once the nature of the problem is determined, I find that a layered
approach to troubleshooting is helpful and that is what I teach to
others. The exact order of steps might changed based on information
learned in step one, but generally I work my way up the OSI model.

If the problem could possibly be caused by a physical layer issue, try
to determine such. Check the circuits for errors, bouncing links,
indications of mismatched clocking configurations, faulty CSU/DSUs,
faulty router interfaces, or bad cabling. If all of that appears to be
okay then I consider the datalink layer.

Could the problem defined in step one be caused by a datalink layer
issue? Was the encapsulation changed on a router interface? If frame
relay, is the router seeing LMI from the frame relay switch? Is there
evidence of dropped frames completely within the cloud (granted, that's
not necessarily datalink layer, but it is a separate 'administrative'
layer if it's out of your control.) I'm sure you can think of a number
of other examples.

Could the problem defined in step one be the results of a network layer
issue? Is there evidence of a routing loop? Do the devices involved have
routing tables that appear to be correct? What do traceroutes and pings
show? Teach them to go hop-by-hop and verify that everything appears as
it should, starting with the device closest to the problem if it's
possible to narrow it down that far.

If routing is determined to be correct, could this be a transport layer
issue? Is it possible that an access list or firewall somewhere is
blocking only certain types of traffic? Does the problem only involve
HTTP? SMTP? Is there policy routing involved that might be redirecting
certain types of traffic to the wrong destination? Where there *any*
recent configuration changes? If so, what were they? Find out, because
they might be the cause of the problem.

This is the general framework I use for troubleshooting and that's how
I've taught the people that work with me. It's constantly evolving and,
of course, the specific steps taken depend on the nature of the issue,
but I find that it helps to have a good foundational troubleshooting