Boxing with Problems

In our last episode, we found out that we had a problem.  Hopefully, a specific one.  Now, we can move toward fixing the problem.  But first we have to determine with some specifics what it is we’re dealing with.  This particular post dovetails in with the end of my previous post and leads to the solutions section.  But it’s very critical that we narrow our focus to the specific problem at hand first.

If anyone has ever dealt with a situation where you are troubleshooting an issue for someone without specific knowledge of the subject matter, you’ve no doubt encountered the Chicken Little Problem Determination Method.  For those not familiar with the story, Chicken Little believes that the sky is falling around him.  Consequently, CLPDM involves the non-educated party making an overly large or non-specific problem determination.   For instance, a user that can’t get to a specific website might determine “The Internet is broken.” or maybe “The network is down.”  These diagnoses do nothing to help us as troubleshooting professionals fix the issue.  In fact, it is usually at this point where we have to start applying our critical thinking skills to narrow the problem down to something specific.

I usually take the overly-broad problem determination and start narrowing my focus in steps.  Most people will see this as the heart of the troubleshooting process.  The way I envision this process is by drawing “boxes” around the problem.  If you’ve ever played the game Jezzball (http://en.wikipedia.org/wiki/JezzBall), you know the key to winning is to draw smaller and smaller boxes, eliminating unnecessary area until there is nothing left.  In much the same way, we start troubleshooting by drawing boxes around our problem and eliminating irrelevant causes or solutions.

Now, you might say to yourself “That sounds pretty simple.  There’s got to be more to it than that.”  And you would be right.  But in truth, for most non-logical, non-step-oriented people, this is the hardest part to get.  Too often, we fall back on past experience too early in the process without investigating all the issues.  I myself am guilty of this on many an occasion.  When we start boxing in our problem, we use our skills and knowledge to eliminate impossible or even highly improbable causes or solutions until all we are left with is the real problem.

In order to illustrate this, we can step through a specific problem to highlight the process of narrowing our focus.  Take the aforementioned user that can’t get to a specific website.  As the network engineer/admin, we start by verifying the most basic of things.

1.  Is the system powered on and plugged into the network?  Yes, feel free to scoff at troubleshooting layer 1, but in truth there have been occasions where a “network outage” was caused by someone carelessly unplugging a network cable or power cord.

Assuming that there is layer 1 connectivity, we have already drawn our first box around the problem and eliminated several possible causes.  We know the system is getting power.  We know the network cable is plugged into the system.  However, we still don’t know if it’s working or not.

2.  Next, we pull up the system and check the network address.  In this step, we verify several things.  If the system has an address, that must mean the network cable is connected end-to-end, since adapters won’t show up in most mainstream OSes unless they are connected end to end.  In most user-facing cases, an address signifies that the computer has connectivity to some sort of addressing server, whether it be DHCP or stateless auto-configuration of some kind.

We’ve now drawn our next box.  The cabling must be good, since we have end-to-end connectivity.  We must also be reaching at least one server on the network if we have a DHCP or stateless address, so the networking stack on the system is working properly.

3.  If we have an address, we should try using ICMP pings to check connectivity to other systems.  In this particular case, the user is trying to reach a server off the network, so the best candidate is to ping the user’s default gateway, as it is responsible for routing packets off the local network.  In this particular example, the pings all succeed with no issues.

We’ve now drawn an even smaller box around this problem.  If the user is saying that the problem is getting to a website not on our network, we needn’t bother testing connectivity to the printer down the hall.  Narrowing your focus to the parts and pieces necessary to the problem is a skill that sometimes takes experience or specific knowledge.  But after a while you start to get a feel for what’s needed and what’s not.

4.  If we have connectivity to the gateway, the next step is to try and ping the site the user is going to.  We know the network connection is good all the way to the gateway, so this step will tell us about the state of things beyond our reach.  In this example, pinging the address http://www.fakedomainname.com failed.

Now we’ve got a nice narrow box around our problem.  And our first glimpse of a failure in our troubleshooting method.  Here’s where the rubber meets the road.  We have to figure out from our steps which direction to take the troubleshooting based on what we see.  We know that the network connection is good to the gateway and that ICMP pings work to that point.  What failed in the test?  In this example, the system returned the message

“Ping request could not find host http://www.fakedomainname.com. Please check the name and try again.”

Anyone familiar with troubleshooting will realize that this message means that the name could not be resolved using DNS.  The next step is to try a test that is similar to the problem in order to verify that this possible root cause is indeed the issue.

5.  Try pinging http://www.google.com.  I tend to use google.com as a static example as it’s servers are typically up and the name resolves on most systems.  In this particular example, the message returned is

“Ping request could not find host http://www.google.com. Please check the name and try again.”

Ah ha!  We’re on to something here!  We appear to be having issues resolving DNS…

6.  In this case, check to see what is assigned as the system’s DNS resolver.  For this example, it’s a local server on the network.

We’ve now got a nice small box around the problem and we’re investigating issues at a very granular level.  We’ve already come very far in the troubleshooting process and we’re almost there.

7.  Can we ping the DNS server that is supposed to be resolving the DNS requests?  In this case, the answer is ‘no’.  The server appears to be offline at this time.

Bingo!  We’ve successfully narrowed the problem from “The Internet is down.” to “The DNS server is offline.”

Now, you may be looking at this example and saying to yourself, “Duh!”  If you are, congratulations.  You’ve figured out how to troubleshoot.  But, did you jump to this conclusion at the beginning without taking all the steps?  If you assumed from the start that DNS was the root cause, you missed out on many other possible problems.  And if your guess didn’t pan out, where would you be left?  That’s right, back at the beginning again.  The key to structured troubleshooting is to take things one step at a time and narrow the problem down.  Jumping to conclusions usually leads you to a cliff sooner or later.  If you continually narrow the box in a structured manner, you’ll always find the problem.  Now what to do about it?

In our next post, we’ll examine methods for fixing problems.  And why’s it’s always a bad idea to not clean up after yourself.

CSCsb42763, or Why I Hate Hunt Groups Sometimes

Ah, CSCsb42763.  My old nemesis.  Those of you in the voice realm may never have heard of this little jewel.  However, you may have heard of its more common name: “Can you make it so that all the members of a hunt group can pick up a call when it’s ringing (call pickup groups)?”  Fine, you say.  You enable the call pickup group for each phone.  Except being logged into a hunt group negates that particular behavior for calls ringing in the hunt group.  Okay, so I’ll just enable the call pickup group on the hunt pilot…ARGGGGGGGGGGGGGGG!!!!!  IT’S NOT THERE!!!!!!!!!!!!!

My friends, say hello to CSCsb42763.  This particular bug has been around since CallManager 4.0(2a)SR1.  According to the TAC Case Collection identifier (http://www.ciscotaccc.com/kaidara-advisor/voice/showcase?case=K11612191), this behavior caused a memory leak in 4.x versions of CallManager and was disabled by default to keep the calls to TAC down to a minimum.  But wait!  This is 2010?  Why hasn’t this been fixed/patched/enabled?  We’ve moved off of Windows and we’re now on our fourth major generation of appliance *coughcoughLinuxcough* support.  I honestly can’t say why this is still an issue.  If I knew exactly how TAC worked I wouldn’t be doing the job I am doing now.  Or I’d be getting paid a LOT more for it.

How do I fix this little behavior without going crazy?  Well, it’s time to get acquainted with the mysterious “Cisco Support Use 1″ parameter:

1.  Log into CallManager and click on System -> Enterprise Parameters.

2.  Scroll down until you find the Cisco Support use section (It’s toward the bottom).  Under there, you’ll find the field “Cisco Support Use 1″ (Yeah, real descriptive there, guys.  Why didn’t you just call it “DON’T TOUCH THIS UNLESS WE TELL YOU!!!!!!!!!!!”)

3.  Enter “CSCsb42763″ (without quotes) exactly as it appears. Yes, case and spelling are very important.  The system is looking for this exactly string to enable the features that you seek.

4.  Go to the hunt group in question and you should now see the “Call Pickup Group” field.

BTW, if you are still on CallManager 4.2 this field has a different name, naturally.  You’ll find it listed under the more descriptive “Unsupported Pickup” parameter under the CCMAdmin parameters page.  But, really, if you’re still on 4.2 you’re probably using tin cans and string to talk to each other and don’t really need hunt groups, now do you?

What a RIP-off!

A few days ago, there were a couple of tweets in my stream about RIP. Yes, the much-maligned, older-than-the-hills Routing Information Protocol. These particular tweets came from a couple of people that are in the study process for their CCIE lab exam. Having had a couple of shots at the lab myself, I found it prudent to mention that while RIP was indeed on the exam, don’t believe for one second that it’s going to be easy. While I can’t and won’t discuss specifics about what I’ve seen, I think I can speak with generality about why RIP is still very much a topic on the now version 4 of the venerable CCIE exam.

Every entry-level networking student learns about RIP. It was the very first routing protocol, and as such serves as a prototype for beginners to learn about topics like hop counts, routing tables, and other more esoteric subjects. RIP was designed to do one thing and do it well. Because of that, it doesn’t take long before he complexity of RIP is exhausted and students move on to bigger and better routing topics. In fact, the only purpose that RIP serves past the very early point is as a yardstick, a way of measuring how much better something is at its job than RIP.  Students quickly learn why EIGRP is much better as an advanced distance vector routing protocol.  Or why link-state is much better for larger networks.  Because of this, CCNAs and JNCIAs all over quickly develop the idea that “RIP sucks”.

I’m not a RIP apologist by any stretch of the imagination.  But I also have a healthy respect for the fact that while RIP may not be the most impressive routing protocol by today’s lofty standards, its place in history is secure by it’s longevity and due to the fact that we wouldn’t have EIGRP and OSPF today if RIP hadn’t paved the way for them.  So, why then would a 22-year old protocol that has been eclipsed in almost every conceivable way still show up in the blueprint for the granddaddy of all routing and switching exams? In short, to screw with your head.

Anyone with a driver’s license can probably cite traffic laws.  And any of them can’t most likely describe the procedure to parallel park.  Or park on a hill.  But the second half of the US drivers test doesn’t quiz you on your knowledge of how to parallel park.  They make you go out and do it.  And often, you don’t get to parallel park in perfect conditions like you practiced for weeks and months before your test.  No, if the driving instructor is particularly insidious, they might make you parallel park on a hill in a school zone.  In much the same way, RIP is on the test to mess with you.  You know RIP inside and out.  If you’ve read Jeff Doyle’s Routing TCP/IP volume 1 you can likely diagram a RIP update packet with some toothpicks and a pencil.  But, the devious-minded proctors and test writers aren’t going to ask you CCNA-level RIP questions.  No, much like the driving instructor, they are going to make you apply your RIP knowledge to situations you might not have encountered in practice.  Like establishing RIP neighbor relationships across 4 routers with no tunnels allowed.  Or, more likely, they will give you a very simple setup followed by the most infamous of all CCIE candidate 4-letter words: redistribution.  Most people know how RIP ticks, but when you start injecting RIP into a perfectly stable OSPF topology, that’s when the rubber hits the road and most things start falling apart.  Knowing how to pick up the pieces after RIP trashes your orderly link-state protocol is one of the things that shows you are a RIP genius and aren’t scared to get your hands dirty.

You might ask yourself, “Well, anyone can write evil RIP questions all day long.  Why is it still on the exam.  Who even still uses RIP?” Good question? Who still uses frame relay?  Or dial up modems?  Or bridges?  Being a CCIE means that you aren’t phased when you run into something old and (relatively) complicated.  It means you understand why RIP and frame relay and bridges paid their dues so that today we could have OSPF and MPLS and TrILL.  And as long as there is still a mainframe out there that only speaks routed or a network that has two routers that were made during the Carter administration, you are still going to encounter RIP.  And if you think outside the box on that stressful day in a dark lab somewhere, you don’t have to worry about being RIPed off by your grandfather’s routing protocol.

The Five Most Important Clients You’ll Meet

I ran across this little config snip while switching a router from one ISP to another.  I’ve changed the IPs in the access-list to protect the guilty, but I’ve left enough to get the general idea across.  Anyone want to take a wild guess as to what’s wrong with the config? I’ll give you a little hint: The first five callers get an unexpected bonus!


aaa authentication login vty local
!
ip access-list extended VTY
permit ip 172.17.0.0 0.0.255.255 any
permit ip 192.168.10.0 0.0.0.255 any
!
line vty 0 4
login authentication vty
transport input telnet ssh
!
line vty 5 15
access-class VTY in
login authentication vty

Houston, It’s Making a Funny Noise

I’ve always said that when I finally get tired of doing networking stuff, I’m going to write a book about teaching people structured troubleshooting.  This would be useful for not only computer work, but a large variety of things like figuring out when your sink is backing up or how to change the serpentine belt on your car.  The key to structured troubleshooting is thinking logically and taking it step by step.  For the record, I’m going to call my book “Did You Plug the Damn Thing In?”

So, I figured I’d share some of my troubleshooting steps just to give people an idea about how I go about figuring out what’s wrong with something.  Starting with how to know you’ve got a problem.

Step 1:  Find out (specifically) what’s wrong.

Yes, that sounds simplistic and little condescending.  And you’d be surprised at the number of people that can’t tell you what is wrong with something.  Want to piss off your mechanic?  Tell him your car is “making a funny noise”.  Want to raise the ire of your network admin?  Tell her the problem must be because “the network is slow”.  Problems are specific.  It’s not making a funny noise, it sounds like dragging a drawer full of silverware across a washboard.  The network isn’t slow, it takes forever for me to open up http://www.google.com.  Giving a problem a specific diagnosis means you can narrow your focus on what to start investigating.  As the troubleshooter, you have to remember to ask probing questions.  Don’t settle for ‘yes’ or ‘no’.  Make the other person describe, in excruciating detail if necessary, what exactly is wrong.  And take lots of notes.  As the blog title suggests, the Apollo 13 astronauts didn’t tell Houston that their problem was a funny noise.  It was a very specific diagnosis.

Step 2: Repeat the problem, if possible

Most problems are repeatable.  There’s nothing worse that hearing that you’ve got a problem, only to be told “Well, it’s not happening right now.  It’s pretty random.”  No problem should be totally random, especially in the computer/technology realm.  You should always know what steps it takes to make the problem happen.  Granted, lots of catastrophic failures are sudden, such as power outages, blown fuses/power supplies, or outright parts failures.  But, for the most part, things like routing loops or broadcast storms can be traced down to a specific event.  Remember the old joke about the patient that tells the doctor, “Every time I move my arm like this it hurts”?  All joking aside, that’s a great problem diagnosis.  Because you can make it happen over and over again.

Step 3:  Take LOTS of notes

I hate writing things down.  My desk if filled with half-scribbled sticky notes with information that will be useless the next time I look at it.  But when you start troubleshooting something, be sure to write down every detail.  Why?  Because you may often find yourself referring to your notes and find out that some minor detail at the start is the cause/solution to your problem.  Such as, “We were cleaning the other day in the server room.”  Not really important at the beginning, but after you determine there is a network outage to the e-mail server and find out someone unplugged the cord to make it look pretty under the desk, that little nugget of information could be very handy.

At this point, you should know exactly what you are facing and how to repeat it as necessary.  You should also have a head start on documenting everything so you know when you’ve fixed the problem.  Next time, we’ll explore how to get info about how to fix what ails you.

The End is the Beginning is the End

If you’re seeing this post, it means that I’ve finally gotten started on a blog.  Or, months from now, it means that you’ve reached the end of my posting in reverse chronological order.  Either way, I hope will be (has been ) a fun read.

This is a place for me to condense thoughts and post things I might need to find later that would end up getting lost in the dark recesses of my mind.  Be warned that there are LOTS of things that pass through there on a regular basis, so you never know what’s going to end up here.

For the most part, this is going to be a technical place.  I’m a Cisco partner engineer, and I’m a CCNA, CCDA, CCNP, CCDP, CCVP, CCSP, and a whole host of partner specializations.  I’m also a CISSP, MCSE: Messaging, MCSE: Security, Novell Master CNE, VMWare VCP, and a litany of other things.  I say this not to toot my own horn, but to show that I’m pretty much all over the map when it comes to technical nerdage.  I work on a lot of things in my job, and if they call that ‘wearing many hats’ I should open a haberdashery.

I’m also on Twitter, so feel free to follow me.  I’m @networkingnerd.  This place will probably end up being a repository for my thoughts that run longer than 140 characters, but rest assured that the same snark that accompanies my tweets will probably leak over here.

So, look for more content to come.  Unless you’ve already waded through the content and found this last (first) entry.  In which case, start over again and make sure I’m keeping up with things.