In our last episode, we found out that we had a problem. Hopefully, a specific one. Now, we can move toward fixing the problem. But first we have to determine with some specifics what it is we’re dealing with. This particular post dovetails in with the end of my previous post and leads to the solutions section. But it’s very critical that we narrow our focus to the specific problem at hand first.
If anyone has ever dealt with a situation where you are troubleshooting an issue for someone without specific knowledge of the subject matter, you’ve no doubt encountered the Chicken Little Problem Determination Method. For those not familiar with the story, Chicken Little believes that the sky is falling around him. Consequently, CLPDM involves the non-educated party making an overly large or non-specific problem determination. For instance, a user that can’t get to a specific website might determine “The Internet is broken.” or maybe “The network is down.” These diagnoses do nothing to help us as troubleshooting professionals fix the issue. In fact, it is usually at this point where we have to start applying our critical thinking skills to narrow the problem down to something specific.
I usually take the overly-broad problem determination and start narrowing my focus in steps. Most people will see this as the heart of the troubleshooting process. The way I envision this process is by drawing “boxes” around the problem. If you’ve ever played the game Jezzball (http://en.wikipedia.org/wiki/JezzBall), you know the key to winning is to draw smaller and smaller boxes, eliminating unnecessary area until there is nothing left. In much the same way, we start troubleshooting by drawing boxes around our problem and eliminating irrelevant causes or solutions.
Now, you might say to yourself “That sounds pretty simple. There’s got to be more to it than that.” And you would be right. But in truth, for most non-logical, non-step-oriented people, this is the hardest part to get. Too often, we fall back on past experience too early in the process without investigating all the issues. I myself am guilty of this on many an occasion. When we start boxing in our problem, we use our skills and knowledge to eliminate impossible or even highly improbable causes or solutions until all we are left with is the real problem.
In order to illustrate this, we can step through a specific problem to highlight the process of narrowing our focus. Take the aforementioned user that can’t get to a specific website. As the network engineer/admin, we start by verifying the most basic of things.
1. Is the system powered on and plugged into the network? Yes, feel free to scoff at troubleshooting layer 1, but in truth there have been occasions where a “network outage” was caused by someone carelessly unplugging a network cable or power cord.
Assuming that there is layer 1 connectivity, we have already drawn our first box around the problem and eliminated several possible causes. We know the system is getting power. We know the network cable is plugged into the system. However, we still don’t know if it’s working or not.
2. Next, we pull up the system and check the network address. In this step, we verify several things. If the system has an address, that must mean the network cable is connected end-to-end, since adapters won’t show up in most mainstream OSes unless they are connected end to end. In most user-facing cases, an address signifies that the computer has connectivity to some sort of addressing server, whether it be DHCP or stateless auto-configuration of some kind.
We’ve now drawn our next box. The cabling must be good, since we have end-to-end connectivity. We must also be reaching at least one server on the network if we have a DHCP or stateless address, so the networking stack on the system is working properly.
3. If we have an address, we should try using ICMP pings to check connectivity to other systems. In this particular case, the user is trying to reach a server off the network, so the best candidate is to ping the user’s default gateway, as it is responsible for routing packets off the local network. In this particular example, the pings all succeed with no issues.
We’ve now drawn an even smaller box around this problem. If the user is saying that the problem is getting to a website not on our network, we needn’t bother testing connectivity to the printer down the hall. Narrowing your focus to the parts and pieces necessary to the problem is a skill that sometimes takes experience or specific knowledge. But after a while you start to get a feel for what’s needed and what’s not.
4. If we have connectivity to the gateway, the next step is to try and ping the site the user is going to. We know the network connection is good all the way to the gateway, so this step will tell us about the state of things beyond our reach. In this example, pinging the address http://www.fakedomainname.com failed.
Now we’ve got a nice narrow box around our problem. And our first glimpse of a failure in our troubleshooting method. Here’s where the rubber meets the road. We have to figure out from our steps which direction to take the troubleshooting based on what we see. We know that the network connection is good to the gateway and that ICMP pings work to that point. What failed in the test? In this example, the system returned the message
“Ping request could not find host http://www.fakedomainname.com. Please check the name and try again.”
Anyone familiar with troubleshooting will realize that this message means that the name could not be resolved using DNS. The next step is to try a test that is similar to the problem in order to verify that this possible root cause is indeed the issue.
5. Try pinging http://www.google.com. I tend to use google.com as a static example as it’s servers are typically up and the name resolves on most systems. In this particular example, the message returned is
“Ping request could not find host http://www.google.com. Please check the name and try again.”
Ah ha! We’re on to something here! We appear to be having issues resolving DNS…
6. In this case, check to see what is assigned as the system’s DNS resolver. For this example, it’s a local server on the network.
We’ve now got a nice small box around the problem and we’re investigating issues at a very granular level. We’ve already come very far in the troubleshooting process and we’re almost there.
7. Can we ping the DNS server that is supposed to be resolving the DNS requests? In this case, the answer is ‘no’. The server appears to be offline at this time.
Bingo! We’ve successfully narrowed the problem from “The Internet is down.” to “The DNS server is offline.”
Now, you may be looking at this example and saying to yourself, “Duh!” If you are, congratulations. You’ve figured out how to troubleshoot. But, did you jump to this conclusion at the beginning without taking all the steps? If you assumed from the start that DNS was the root cause, you missed out on many other possible problems. And if your guess didn’t pan out, where would you be left? That’s right, back at the beginning again. The key to structured troubleshooting is to take things one step at a time and narrow the problem down. Jumping to conclusions usually leads you to a cliff sooner or later. If you continually narrow the box in a structured manner, you’ll always find the problem. Now what to do about it?
In our next post, we’ll examine methods for fixing problems. And why’s it’s always a bad idea to not clean up after yourself.