Troubleshooting and Triage


When troubleshooting any major issue, people tend to feel a bit lost at first.  There is the crowd that wants to fix the immediate problem.  Then there is the group that wants to look at everything going on and address the root problem no matter how long it takes.  The key to troubleshooting is to realize how each of these approaches has their place and how they are both right and wrong at the same time.

The first approach is triage.  Think of it like a medical emergency room.  Their purpose is to fix the immediate symptoms and stabilize the patient.  Especially critical is the stabilization part.  You can’t fix a network that has bouncing routes or intermittent bridging loops.  Often the true root cause of the problem is buried beneath a pile of other symptoms.  Only when the immediate issues are resolved does the real problem surface.  Learning how to triage problems is a very important troubleshooting skill.  It gives a quick response while allowing the worst of the issue to be dealt with.

It’s important to remember that triage is just a quick fix.  Emergency rooms would never triage a patient without following up with a more in-depth consult or return visit.  Triage fails when engineers leave the patch in place and consider it the final solution.  Most times that I’ve seen this approach have been due to time constraints.  Rather than spending the time to research and test to find the true problem people are content to make the majority of the symptoms go away no matter how briefly.  It happens all the time.

“Just make it work for now.  We’ll fix it later.”

“If we configure it like this, will it stay up until the end of the quarter?”

“We don’t have time to debate this.  The CEO wants things up NOW!”

True in-depth troubleshooting is what happens when we have time and a clear way to solve the deeper root issues.  Deep troubleshooting figures out that the cause of a route flap is actually a bad Ethernet cable.  That’s not something you can easily determine from a quick analysis.  It takes time and effort to figure out.  When I worked on an inbound desktop help desk, we tested for CD-ROM failures by flipping the IDE cables back and forth on the IDE ports on the motherboard.  In part, this was to test to ensure the drive failure followed the switch of cables and ports.  In addition, it also tested the cable and port to make sure the dead drive wasn’t masking a bigger failure.  It took more time to do it properly but we never ran into an issue where a good CD-ROM drive was returned and the problem persisted.

In-depth troubleshooting can fail when there are so many problems masking the real issue that you start trying to fix the wrong problem.  Tunnel vision is easy to get when working on a problem.  If you tunnel in on an ancillary symptom and fail to fix the root cause you aren’t really doing much better than simple triage.  Just like a doctor, you need to ensure that you are treating the real problem under all the symptoms.  Remember not to be sidetracked by each small issue you uncover.  Fix them and keep digging for the real issue underneath it all.


Tom’s Take

I’ve had a lot of people comment that I was able to figure out problems quickly.  They also liked how I was able to “fix” things quickly.  That’s because I was very good at triage.  In my job as a VAR engineer, I didn’t really have time to dig deeper into the issue to uncover root cause.  Thankfully, a couple of the guys that I worked with were the exact opposite of me.  They loved digging into problems and pulling everything apart until they found the real issue.  They were labeled “slow” or “methodical” by some.  I loved working with them because the complemented my style perfectly.  I fix the big issues and make people happy.  They fix the underlying cause and keep them that way.  Just like ER doctors and specialists.  We both have our place.  It’s important to realize which is more important at a given time.

Advertisements

2 thoughts on “Troubleshooting and Triage

  1. A great tip I received from Brad Maltz was to open a support ticket first thing, even if you fix the problem before they respond they might help in root cause analysis.

  2. Pingback: SDN Myths Revisited | The Networking Nerd

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s