A Handy Acronym for Troubleshooting

While I may be getting further from my days of being an active IT troubleshooter it doesn’t mean that I can’t keep refining my technique. As I spend time looking back on my formative years of doing troubleshooting either from a desktop perspective or from a larger enterprise role I find that there were always a few things that were critical to understand about the issues I was facing.

Sadly, getting that information out of people in the middle of a crisis wasn’t always super easy. I often ran into people that were very hard to communicate with during an outage or a big problem. Sometimes they were complicit because they made the mistake that caused it. They also bristled at the idea of someone else coming to fix something they couldn’t or wouldn’t. Just as often I ran into people that loved to give me lots of information that wasn’t relevant to the issue. Whether they were nervous talkers or just had a bad grasp on the situation it resulted in me having to sift through all that data to tease out the information I needed.

The Method

Today, as I look back on my career I would like to posit an idea of collecting the information that you need in order to effectively troubleshoot an issue.

  • Scope: How big is this problem? Is it just a single system or is it an entire building? Is it every site? If you don’t know how widespread the problem is you can’t really begin to figure out how to fix it. You need to properly understand the scope. That also includes understanding what the scope of the system for the business is. Taking down a reservation system for an airline is a bigger deal that guest Wi-Fi being down at a restaurant.
  • Timeline: When did this start happening? What occurred right before? Were there any issues that you think might have contributed here. It’s important to make the people you’re working with understand that a proper timeline is critical because it allows you to eliminate issues. You don’t want to spend hours trying to find the root cause in one system only to learn it wasn’t even powered on at the time and the real cause is in a switch that was just plugged in.
  • Frequency: Is this the first time this has happened? Does it happen randomly or seemingly on a schedule? This one helps you figure out if it’s systemic and regular or just cosmic rays. It also forces your team or customers to think about when it’s occurring and how far back the issue goes. If you come in thinking it’s a one-off that happened yesterday only to find out it’s actually been happening for weeks or months you’ll take a much different approach.
  • Urgency: Is this an emergency? Are we talking about a hospital ER being down or a typo in a documentation page? Do I need to roll out to spend the whole night fixing this or is it something that I can look at on a scheduled visit. Be sure to note the reasoning behind why they choose to make it a priority too. Some customers love to make everything a dire emergency just to ensure they get someone out right away. At least until it’s time to pay the emergency call rate.

A four step plan that’s easy to remember. Scope, Timeline, Frequency, Urgency. STFU.

Memory Aids

Okay, you can stop giggling now. I did that on purpose. In part to help you remember what the acronym was. In part to help you take a big of a relaxed approach to troubleshooting. In, in some ways, to help you learn to get those chatterboxes and pushy stakeholders off your back. If your methodology includes STFU they might figure out quickly that you need to be the one doing the talking and they need to be the one giving the answers, not the other way around.

And yes, each of these little steps would have saved me so much time in my old role. For example:

  • Scope – Was the whole network down? Or did one of the kids just unplug your Ethernet cable?
  • Frequency – Has this server seriously been beeping every 30 seconds for the last two years? Did you bother to look at the error message?
  • Timeline – Yes, I would assume that when you put that lab switch into your network was when the problem with VTP started.
  • Urgency – Do you really need me to drive three hours to press the F1 key on a keyboard?

I seriously have dozens of examples but these are four of the stories I tell all of the time to show just how some basic understanding can help people do more than they think.


Tom’s Take

People love mnemonic devices to remember things. Whether it’s My Very Eager Mother Just Served Us Nine (Pizzas) to remember the 8 planets and that one weird one or All People Seem To Need Data Processing to remember the seven layers of the OSI Model. I remember thinking through the important need-to-know information for doing some basic initial troubleshooting and how easily it fit into an acronym that could be handy for other things too when you’re in a stressful situation. Feel free to use it.