How Should We Handle Failure?

I had an interesting conversation this week with Greg Ferro about the network and how we’re constantly proving whether a problem is or is not the fault of the network. I postulated that the network gets blamed when old software has a hiccup. Greg’s response was:

Which led me to think about why we have such a hard time proving the innocence of the network. And I think it’s because we have a problem with applications.

Snappy Apps

Writing applications is hard. I base this on the fact that I am a smart person and I can’t do it. Therefore it must be hard, like quantum mechanics and figuring out how to load the dishwasher. The few people I know that do write applications are very good at turning gibberish into usable radio buttons. But they have a world of issues they have to deal with.

Error handling in applications is a mess at best. When I took C Programming in college, my professor was an actual coder during the day. She told us during the error handling portion of the class that most modern programs are a bunch of statements wrapped in if clauses that jump to some error condition in if they fail. All of the power of your phone apps is likely a series of if this, then that, otherwise this failure condition statements.

Some of these errors are pretty specific. If an application calls a database, it will respond with a database error. If it needs an SSL certificate, there’s probably a certificate specific error message. But what about those errors that aren’t quite as cut-and-dried? What if the error could actually be caused by a lot of different things?

My first real troubleshooting on a computer came courtesy of Lucasarts X-Wing. I loved the game and played the training missions constantly until I was able to beat them perfectly just a few days after installing it. However, when I started playing the campaign, the program would crash to DOS when I started the second mission with an out of memory error message. In the days before the Internet, it took a lot of research to figure out what was going on. I had more than enough RAM. I met all the specs on the side of the box. What I didn’t know and had to learn is that X-Wing required the use of Expanded Memory (EMS) to run properly. Once I decoded enough of the message to find that out, I was able to change things and make the game run properly. But I had to know that the memory X-Wing was complaining about was EMS, not the XMS RAM that I needed for Windows 3.1.

Generic error messages are the bane of the IT professional’s existence. If you don’t believe me, ask yourself how many times you’ve spent troubleshooting a network connectivity issue for a server only to find that DNS was misconfigured or down. In fact, Dr. House has a diagnosis for you:

Unintuitive Networking

If the error messages are so vague when it comes to network resources and applications, how are we supposed to figure out what went wrong and how to fix it? I mean, humans have been doing that for years. We take the little amount of information that we get from various sources. Then we compile, cross-reference, and correlate it all until we have enough to make a wild guess about what might be wrong. And when that doesn’t work, we keep trying even more outlandish things until something sticks. That’s what humans do. We will in the gaps and come up with crazy ideas that occasionally work.

But computers can’t do that. Even the best machine learning algorithms can’t extrapolate data from a given set. They need precise inputs to find a solution. Think of it like a Google search for a resolution. If you don’t have a specific enough error message to find the problem, you aren’t going to have enough data to provide a specific fix. You will need to do some work on your own to find the real answer.

Intent-based networking does little to fix this right now. All intent-based networking products are designed to create a steady state from a starting point. No matter how cool it looks during the demo or how powerful it claims to be, it will always fall over when it fails. And how easily it falls over is up to the people that programmed it. If the system is a black box with no error reporting capabilities, it’s going to fail spectacularly with no hope of repair beyond an expensive support contract.

It could be argued that intent-based networking is about making networking easier. It’s about setting things up right the first time and making them work in the future. Yet, no system in a steady state works forever. With the possible exception of the tar pitch experiment, everything will fail eventually. And if the control system in place doesn’t have the capability to handle the errors being seen and propose a solution, then your fancy provisioning system is worth about as much as a car with no seat belts.

Fixing The Issues

So, how do we fix this mess? Well, the first thing that we need to do is scold the application developers. They’re easy targets and usually the cause behind all of these issues. Instead of giving us a vague error message about network connectivity, we need more like lp0 Is On Fire. We need to know what was going on when the function call failed. We need a trace. We need context around process to be able to figure out why something happened.

Now, once we get all these messages, we need a way to correlate them. Is that an NMS or some other kind of platform? Is that an incident response system? I don’t know the answer there, but you and your staff need to know what’s going on. They need to be able to see the problems as they occur and deal with the errors. If it’s DNS (see above), you need to know that fast so you can resolve the problem. If it’s a BGP error or a path problem upstream, you need to know that as soon as possible so you can assess the impact.

And lastly, we’ve got to do a better job of documenting these things. We need to take charge of keeping track of what works and what doesn’t. And keeping track of error messages and their meanings as we work on getting app developers to give us better ones. Because no one wants to be in the same boat as DenverCoder9.


Tom’s Take

I made a career out of taking vague error messages and making things work again. And I hated every second of it. Because as soon as I became the Montgomery Scott of the Network, people started coming to me with every more bizarre messages. Becoming the Rosetta Stone for error messages typed into Google is not how you want to spend your professional career. Yet, you need to understand that even the best things fail and that you need to be prepared for it. Don’t spend money building a very pretty facade for your house if it’s built on quicksand. Demand that your orchestration systems have the capabilities to help you diagnose errors when everything goes wrong.

 

Advertisements

Is It Really Always The Network?

Keep Calm and Blame the Network

Image from Thomas LaRock

I had a great time over the last month writing a series of posts with my friend John Herbert (@MrTugs) over on the SolarWinds Geek Speak Blog. You can find the first post here. John and I explored the idea that people are always blaming the network for a variety of issues that are often completely unrelated to the actual operation of the network. It was fun writing some narrative prose for once, and the feedback we got was actually pretty awesome. But I wanted to take some time to explain the rationale behind my madness. Why is it that we are always blaming the network?!?

Visibility Is Vital

Think about all the times you’ve been working on an application and things start slowing down. What’s the first thing you think of? If it’s a standalone app, it’s probably some kind of processing lag or memory issues. But if that app connects to any other thing, whether it be a local network or a remote network via the Internet, the first culprit is the connection between systems.

It’s not a large logical leap to make. We have to start by assuming the the people that made the application knew what they were doing. If hundreds of other people aren’t having this problem, it must not be with the application, right? We’ve already started eliminating the application as the source of the issues even before we start figuring out what went wrong.

People will blame the most visible part of the system for issues. If that’s a standalone system sealed off from the rest of the world, it obviously must be the application. However, we don’t usually build these kinds of walled-off systems any longer. Almost every application in existence today requires a network connection of some kind. Whether it’s to get updates or to interact with other data or people, the application needs to talk to someone or maybe even everyone.

I’ve talked before about the need to make the network more of a “utility”. Part of the reason for this is that it lowers the visibility of the network to the rest of the IT organization. Lower visibility means fewer issues being incorrectly blamed on the network. It also means that the network is going to be doing more to move packets and less to fix broken application issues.

Blame What You Know

If your network isn’t stable to begin with, it will soon become the source of all issues in IT even if the network has nothing to do with the app. That’s because people tend to identify problem sources based on their own experience. If you are unsure of that, work on a consumer system helpdesk sometime and try and keep track of the number of calls that you get that were caused by “viruses” even if there’s no indication that this is a virus related issue. It’s staggering.

The same thing happens in networking and other enterprise IT. People only start troubleshooting problems from areas of known expertise. This usually breaks down by people shouting out solutions like, “I saw this once so it must be that! I mean, the symptoms aren’t similar and the output is totally different, but it must be that because I know how to fix it!”

People get uncomfortable when they are faced with troubleshooting something unknown to them. That’s why they fall back on familiar things. And if they constantly hear how the network is the source of all issues, guess what the first thing to get blamed is going to be?

Network admins and engineers have to fight a constant battle to disprove the network as the source of issues. And for every win they get it can come crashing down when the network is actually the problem. Validating the fears of the users is the fastest way to be seen as the issue every time.

Mean Time To Innocence

As John and I wrote the pieces for SolarWinds, what we wanted to show is that a variety of issues can look like network things up front but have vastly different causes behind the scenes. What I felt was very important for the piece was the distinguish the the main character, Amanda, go beyond the infamous Mean Time To Innocence (MTTI) metric. In networking, we all too often find ourselves going so far as to prove that it’s not the network and then leave it there. As soon as we’re innocent, it’s done.

Cross-function teams and other DevOps organizations don’t believe in that kind of boundary. Lines between silos blur or are totally absent. That means that instead of giving up once you prove it’s not your problem, you need to work toward fixing what’s at issue. Fix the problem, not the blame. If you concentrate on fixing the problems, it will soon become noticeable that networking team may not always be the problem. Even if the network is at fault, the network team will work to fix it and any other issues that you see.


Tom’s Take

I love the feedback that John and I have gotten so far on the series we wrote. Some said it feels like a situation they’ve been in before. Others have said that they applaud the way things were handled. I know that the narrative allows us to bypass some of the unsavory things that often happen, like argument and political posturing inside an organization to make other department heads look bad when a problem strikes. But what we really wanted to show is that the network is usually the first to get blamed and the last to keep its innocence in a situation like this.

We wanted to show that it’s not always the network. And the best way for you to prove that in your own organization is to make sure the network isn’t just innocent, but helpful in solving as many problems as possible.

Repeat After Me

callandresponse

Thanks to Tech Field Day, I fly a lot. As Southwest is my airline of choice and I have status, I tend to find myself sitting the slightly more comfortable exit row seating. One of the things that any air passenger sitting in the exit row knows by heart is the exit row briefing. You must listen to the flight attendant brief you on the exit door operation and the evacuation plan. You are also required to answer with a verbal acknowledgment.

I know that verbal acknowledgment is a federal law. I’ve also seen some people blatantly disregard the need to verbal accept responsibility for their seating choice, leading to some hilarious stories. But it also made me think about why making people talk to you is the best way to make them understand what you’re saying

Sotto Voce

Today’s society full of distractions from auto-play videos on Facebook to Pokemon Go parks at midnight is designed to capture the attention span of a human for a few fleeting seconds. Even playing a mini-trailer before a movie trailer is designed to capture someone’s attention for a moment. That’s fine in a world where distraction is assumed and people try to multitask several different streams of information at once.

People are also competing for noise attention as well. Pleasant voices are everywhere trying to sell us things. High volume voices are trying to cut through the din to sell us even more things. People walk around the headphones in most of the day. Those that don’t pollute what’s left with cute videos that sound like they were recorded in an echo chamber.

This has also led to a larger amount of non-verbal behavior being misinterpreted. I’ve experienced this myself on many occasions. People distracted by a song on their phone or thinking about lyrics in their mind may nod or shake their head in rhythm. If you ask them a question just before the “good part” and they don’t hear you clearly, they may appear to agree or disagree even though they don’t know what they just heard.

Even worse is when you ask someone to do something for you and they agree only to turn around and ask, “What was it you wanted again?” or “Sorry, I didn’t catch that.” It’s become acceptable in society to agree to things without understanding their meaning. This leads to breakdowns in communication and pieces of the job left undone because you assume someone was going to do something when they agreed, yet they agreed and then didn’t understand what they were supposed to do.

Fortississimo!

I’ve found that the most effective way to get someone to understand what you’ve told them is to ask you to repeat it back in their own words. It may sound a bit silly to hear what you just told them, but think about the steps that they must go through:

  • They have to stop for moment and think about what you said.
  • They then have to internalize the concepts so they understand them.
  • They then must repeat back to you those concepts in their own phrasing.

Those three steps mean that the ideas behind what you are stating or asking must be considered for a period of time. It means that the ideas will register and be remembered because they were considered when repeating them back to the speaker.

Think about this in a troubleshooting example. A junior admin is supposed to go down the hall and tell you when a link light comes for port 38. If the admin just nods and doesn’t pay attention, ask them to repeat those instructions back. The admin will need to remember that port 38 is the right port and that they need to wait until the link light is on before saying something. It’s only two pieces of information, but it does require thought and timing. By making the admin repeat the instructions, you make sure they have them down right.


Tom’s Take

Think about all the times recently when someone has repeated something back to you. A food order or an amount of money given to you to pay for something. Perhaps it was a long list of items to accomplish for an event or a task. Repetition is important to internalize things. It builds neural pathways that force the information into longer-term memory. That’s why a couple of seconds of repetition are time well invested.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.

Network Firefighters or Fire Marshals?

FireMarshal

Throughout my career as a network engineer, I’ve heard lots of comparisons to emergency responders thrown around to describe what the networking team does. Sometimes we’re the network police that bust offenders of bandwidth polices. Other times there is the Network SWAT Team that fixes things that get broken when no one else can get the job done. But over and over again I hear network admins and engineers called “fire fighters”. I think it’s time to change how we look at the job of fires on the network.

Fight The Network

The president of my old company used to try to motivate us to think beyond our current job roles by saying, “We need to stop being firefighters.” It was absolutely true. However, the sentiment lacked some of the important details of what exactly a modern network professional actually does.

Think about your job. You spend most of your time implementing change requests and trying to fix things that don’t go according to plan. Or figuring out why a change six months ago suddenly decided today to create a routing loop. And every problem you encounter is a huge one that requires an “all hands on deck” mentality to fix it immediately.

People look at networks as a closed system that should never break or require maintenance of any kind until it breaks. There’s a reason why a popular job description (and Twitter handle) involves describing your networking job as a plumber or janitor. We’re the response personnel that get called out on holidays to fix a problem that creates a mess for other people that don’t know how to fix it.

And so we come to the fire fighter analogy. The idea of a group of highly trained individuals sitting around the NOC (or firehouse) waiting for the alarm to go off. We scramble to get the problem fixed and prevent disaster. Then go back to our NOC and clean up to do it all over again. People think about networking professionals this way because they only really talk to us when there’s a crisis that we need to deal with.

The catch with networking is that we’re very rarely sitting around playing ping pong or cleaning our gear. In the networking world we find ourselves being pushed from one crisis to the next. A change here creates a problem there. A new application needs changes to run. An old device created a catastrophe when a new one was brought online. Someone did something without approval that prevented the CEO from watching Youtube. The list is long and distinguished. But it all requires us to fix things at the end of the day.

The problem isn’t with the fire fighting mentality. Our society wouldn’t exist without firefighters. We still have to have protection against disaster. So how is it that we can have an ordered society with relatively few firefighters versus the kinds of chaos we see in networking?

Marshal, Marshal, Marshal

The key missing piece is the fire marshal. Firefighters put out fires. Fire marshals prevent them from happening in the first place. They do this by enforcing the local fire codes and investigating what caused a fire in the first place.

The most visible reminder of the job of a fire marshal is found in any bar or meeting room. There is almost always a sign by the door stating the maximum occupancy according to the local fire code. Occupancy limits prevent larger disasters should a fire occur. Yes, it’s a bit of bummer when you are told that no one else can enter the bar until some people leave. But the fire marshal would rather deal with unhappy patrons than with injuries or deaths in case of fire.

Other duties of the fire marshal include investigating the root cause of a fire and trying to find out how to prevent it in the future. Was is deliberate? Did it involve a faulty piece of equipment? These are all important questions to ask in order to find out how to keep things from happening all over again.

Let’s apply these ideas to the image of network professionals as firefighters. Instead of spending the majority of time dealing with the fallout from bad decisions there should be someone designated to enforce network policy and explain why those policies exist. Rather than caving to the whims of application developers that believe their program needs elevated QoS treatment, the network fire marshal can explain why QoS is tiered the way that it is and why including this new app in the wrong tier will create chaos down the road.

How about designating a network fire marshal to figure out why there was a routing table meltdown? Too often we find ourselves fixing the problem and not caring about the root cause so long as we can reassure ourselves that the problem won’t come back. A network fire marshal can do a post mortem investigation and find out which change caused the issue and how to prevent it in the future with proper documentation or policy changes.

It sounds simple and straightforward to many folks. Yet these are the things that tend to be neglected when the fires flare up and time becomes a premium resource. By having someone to fill the role of investigator and educator, we can only hope that network fires can be prevented before they start.


Tom’s Take

Networking can’t evolve if we spend the majority of our time dealing with the disasters we could have prevented with even a few minutes of education or caution. SDN seeks to prevent us from shooting off our own foot. But we can also help out by looking to the fire marshal as an example of how to prevent these network fires before they start. With some education and a bit of foresight we can reduce our workload of disaster response with disaster prevention. Imagine how much more time we would have to sit around the NOC and play ping pong?

 

The Splinter In Your Mind

Splinter

You don’t know what it is, but it’s there, like a splinter in your mind, driving you mad. – Morpheus

We’ve all had a moment when we’re troubleshooting an issue and something just doesn’t feel right.  Or we’ve put together a solution and it works but there’s a little voice in the back of your mind telling you that something is missing.  You can’t quite put your finger on it but you know you can’t rest until you’ve figured it out.

The quote above comes from the first Matrix movie, when Morpheus is trying to explain to Thomas Anderson exactly why he feels so out of place in the world.  The term actually predates that movie, having been the title of an excellent Star Wars novel.  Splinter of the Mind’s Eye was released in 1978, so it’s almost as old as I am!  The term describes that feeling you get when something is nagging away at you and won’t let go.  I get it quite often.  Sometimes I recognize a person but can’t remember who they are.  Other times I can’t remember a critical step to a project.  But it comes very often when I’m troubleshooting a problem and the solution is just agonizingly out of reach.

How do you combat the splinter?  What can you do to overcome that feeling that will drive you crazy in short order?  Here are a few things I do.  Sometimes they help, sometimes they don’t.  But the idea is to try and dislodge the splinter and get your thought process rolling again.

Think It Through

This is probably my favorite solution.  When I’m faced with a tough problem and an elusive solution, my first step is to walk through the problem step by step.  If it’s a routing loop, I talk my way through the installation of the route into the routing table.  If the problem is a layer 2 issue, I think through the packet as it goes through the network.  The key is that you envision every step along the way.  Often our minds get distracted by an unimportant step and leave it out.  By going back through and thinking of every piece you often force an overlooked concern to the surface.  This can cause the splinter to move and create a new line of thinking.  Perhaps that routing loop is being caused by a redistribution?  Maybe you didn’t know the network used to run RIP and now is running OSPF.  By imagining the packet moving through the network, you can understand where the problem can occur.

You can also speak out loud when thinking things through.  I find it very useful to actually speak the words as I’m thinking.  That’s because my brain runs much faster than my mouth.  By forcing myself to put the thoughts into words, I can usually slow things down long enough to figure out the missing steps.

Draw It Out

If the problem is a little more nebulous, you might need a piece of paper or a whiteboard to draw things.  I’m not the best artist in the world, but I know that a crude diagram of what I’m thinking about will help me visualize things in a new way.  Maybe I forgot to add a piece to the drawing that fixes the issue.  Other times it just helps me think about the problem.  By filtering the splinter in your mind through another creative process like drawing, you can force it out in a different way.  I was very fond of this at my old job when I had a wall-sized whiteboard.  There’s no reason you can’t do it with a regular piece of paper though.  Colored pencils or markers can also help peel apart the layers of the issue.

Forget About It

Yes, it is strange advice to just forget about a problem.  But, ask yourself how many times you’ve stumbled onto the solution when you’re taking a shower or just about to fall asleep?  The brain is a miraculous computer, but sometimes it has a focus problem.  If you think about something for too long, you can get fatigued and lose your ability to apply critical reasoning.  It’s a “forest for the trees” kind of issue.

I always made it a point when I was troubleshooting a really hard problem to walk away for a few minutes.  Whether it was stepping out to get something to eat or just walking into a conference room for five minutes, I always tried to find some time to clear my mind and refocus on the situation.  By thinking about a shopping list or an order form or even the batting order of the 1962 Yankees you can jar the splinter loose and create new connections.  I always joked with my coworkers that the most efficient way to solve high severity issues was to install a shower in my office.  They didn’t find it nearly as funny as I did.


Tom’s Take

I can’t promise that these solutions are going to fix that nagging feeling in the back of your mind.  Some problems are just that tough.  But when you’ve applied every bit of critical reasoning you can to an issue and you’ve reached the point where your stuck but just can’t let go, sometimes it helps to apply one of the above methods.

If you let the splinter fester in the back of your mind, you’ll constantly be asking yourself what you can do or what you need to look at to fix things.  It will eventually consume you if you let it.  Instead, you should look at a way to move the splinter.  If you can do that you’ll sleep better at night.

Troubleshooting and Triage

When troubleshooting any major issue, people tend to feel a bit lost at first.  There is the crowd that wants to fix the immediate problem.  Then there is the group that wants to look at everything going on and address the root problem no matter how long it takes.  The key to troubleshooting is to realize how each of these approaches has their place and how they are both right and wrong at the same time.

The first approach is triage.  Think of it like a medical emergency room.  Their purpose is to fix the immediate symptoms and stabilize the patient.  Especially critical is the stabilization part.  You can’t fix a network that has bouncing routes or intermittent bridging loops.  Often the true root cause of the problem is buried beneath a pile of other symptoms.  Only when the immediate issues are resolved does the real problem surface.  Learning how to triage problems is a very important troubleshooting skill.  It gives a quick response while allowing the worst of the issue to be dealt with.

It’s important to remember that triage is just a quick fix.  Emergency rooms would never triage a patient without following up with a more in-depth consult or return visit.  Triage fails when engineers leave the patch in place and consider it the final solution.  Most times that I’ve seen this approach have been due to time constraints.  Rather than spending the time to research and test to find the true problem people are content to make the majority of the symptoms go away no matter how briefly.  It happens all the time.

“Just make it work for now.  We’ll fix it later.”

“If we configure it like this, will it stay up until the end of the quarter?”

“We don’t have time to debate this.  The CEO wants things up NOW!”

True in-depth troubleshooting is what happens when we have time and a clear way to solve the deeper root issues.  Deep troubleshooting figures out that the cause of a route flap is actually a bad Ethernet cable.  That’s not something you can easily determine from a quick analysis.  It takes time and effort to figure out.  When I worked on an inbound desktop help desk, we tested for CD-ROM failures by flipping the IDE cables back and forth on the IDE ports on the motherboard.  In part, this was to test to ensure the drive failure followed the switch of cables and ports.  In addition, it also tested the cable and port to make sure the dead drive wasn’t masking a bigger failure.  It took more time to do it properly but we never ran into an issue where a good CD-ROM drive was returned and the problem persisted.

In-depth troubleshooting can fail when there are so many problems masking the real issue that you start trying to fix the wrong problem.  Tunnel vision is easy to get when working on a problem.  If you tunnel in on an ancillary symptom and fail to fix the root cause you aren’t really doing much better than simple triage.  Just like a doctor, you need to ensure that you are treating the real problem under all the symptoms.  Remember not to be sidetracked by each small issue you uncover.  Fix them and keep digging for the real issue underneath it all.


Tom’s Take

I’ve had a lot of people comment that I was able to figure out problems quickly.  They also liked how I was able to “fix” things quickly.  That’s because I was very good at triage.  In my job as a VAR engineer, I didn’t really have time to dig deeper into the issue to uncover root cause.  Thankfully, a couple of the guys that I worked with were the exact opposite of me.  They loved digging into problems and pulling everything apart until they found the real issue.  They were labeled “slow” or “methodical” by some.  I loved working with them because the complemented my style perfectly.  I fix the big issues and make people happy.  They fix the underlying cause and keep them that way.  Just like ER doctors and specialists.  We both have our place.  It’s important to realize which is more important at a given time.