A Handy Acronym for Troubleshooting

Posted on January 25, 2024 by networkingnerd

While I may be getting further from my days of being an active IT troubleshooter it doesn’t mean that I can’t keep refining my technique. As I spend time looking back on my formative years of doing troubleshooting either from a desktop perspective or from a larger enterprise role I find that there were always a few things that were critical to understand about the issues I was facing.

Sadly, getting that information out of people in the middle of a crisis wasn’t always super easy. I often ran into people that were very hard to communicate with during an outage or a big problem. Sometimes they were complicit because they made the mistake that caused it. They also bristled at the idea of someone else coming to fix something they couldn’t or wouldn’t. Just as often I ran into people that loved to give me lots of information that wasn’t relevant to the issue. Whether they were nervous talkers or just had a bad grasp on the situation it resulted in me having to sift through all that data to tease out the information I needed.

The Method

Today, as I look back on my career I would like to posit an idea of collecting the information that you need in order to effectively troubleshoot an issue.

Scope: How big is this problem? Is it just a single system or is it an entire building? Is it every site? If you don’t know how widespread the problem is you can’t really begin to figure out how to fix it. You need to properly understand the scope. That also includes understanding what the scope of the system for the business is. Taking down a reservation system for an airline is a bigger deal that guest Wi-Fi being down at a restaurant.
Timeline: When did this start happening? What occurred right before? Were there any issues that you think might have contributed here. It’s important to make the people you’re working with understand that a proper timeline is critical because it allows you to eliminate issues. You don’t want to spend hours trying to find the root cause in one system only to learn it wasn’t even powered on at the time and the real cause is in a switch that was just plugged in.
Frequency: Is this the first time this has happened? Does it happen randomly or seemingly on a schedule? This one helps you figure out if it’s systemic and regular or just cosmic rays. It also forces your team or customers to think about when it’s occurring and how far back the issue goes. If you come in thinking it’s a one-off that happened yesterday only to find out it’s actually been happening for weeks or months you’ll take a much different approach.
Urgency: Is this an emergency? Are we talking about a hospital ER being down or a typo in a documentation page? Do I need to roll out to spend the whole night fixing this or is it something that I can look at on a scheduled visit. Be sure to note the reasoning behind why they choose to make it a priority too. Some customers love to make everything a dire emergency just to ensure they get someone out right away. At least until it’s time to pay the emergency call rate.

A four step plan that’s easy to remember. Scope, Timeline, Frequency, Urgency. STFU.

Memory Aids

Okay, you can stop giggling now. I did that on purpose. In part to help you remember what the acronym was. In part to help you take a big of a relaxed approach to troubleshooting. In, in some ways, to help you learn to get those chatterboxes and pushy stakeholders off your back. If your methodology includes STFU they might figure out quickly that you need to be the one doing the talking and they need to be the one giving the answers, not the other way around.

And yes, each of these little steps would have saved me so much time in my old role. For example:

Scope – Was the whole network down? Or did one of the kids just unplug your Ethernet cable?
Frequency – Has this server seriously been beeping every 30 seconds for the last two years? Did you bother to look at the error message?
Timeline – Yes, I would assume that when you put that lab switch into your network was when the problem with VTP started.
Urgency – Do you really need me to drive three hours to press the F1 key on a keyboard?

I seriously have dozens of examples but these are four of the stories I tell all of the time to show just how some basic understanding can help people do more than they think.

Tom’s Take

People love mnemonic devices to remember things. Whether it’s My Very Eager Mother Just Served Us Nine (Pizzas) to remember the 8 planets and that one weird one or All People Seem To Need Data Processing to remember the seven layers of the OSI Model. I remember thinking through the important need-to-know information for doing some basic initial troubleshooting and how easily it fit into an acronym that could be handy for other things too when you’re in a stressful situation. Feel free to use it.

Assume Disaster

Posted on March 20, 2023 by networkingnerd

One of the things that people have mentioned to me in the past regarding my event management skills is my reaction time. They say, “You are always on top of things when they go wrong. How do you do it?”

My response never fails to make them laugh. I offer, “I always assume something is going to go wrong. I may not know what it is but when it does happen I’m ready to fix it.”

That may sound like a cynical take on planning and operations but it’s served me well for many years. Why is it that things we spend so much time working on always seem to go off the rails?

Complexity Fails

Whether it’s an event or a network or even a carpentry project you have to assume that something is going to go wrong. Why? Because the more complex the project the more likely you are to hit a snag. Systems that build on themselves and require input to proceed are notorious for hitting blocks that cause the whole thing to snarl into a mess of missed timelines.

When I was in college studying project management I learned there’s even a term for time saving: crashing a project. Not literally crashing the project into something but instead looking for ways to trim the timeline and work through issues. Why is this a common term? I’d hazard a guess that very few projects actually stick to their timeline. It could be a parts delay. It could be a team taking longer to work through an issue. Mercury could be in retrograde during sunspots. Whatever the case may be, projects are designed to have floating timelines.

This imprecision built into project planning made me realize that the only way to be really sure that something would get done properly was to anticipate the errors and work through them. Part of the way to prevent these issues is to reduce complexity. You may not be able to work through every potential scenario where something is going to go sideways but you can almost always tell where the problems will arise. Any module of work that has lots of moving parts or lots of people with specific deadlines is going to be a trouble spot. The more components that depend on each other means a greater chance that any one of them slipping will cause a delay that requires attention.

If you have a project or are planning something that has complicated steps for a specific goal, try to break those down into more simple things that don’t depend on each other. Have a team that needs to write a report based on the research from another team? Don’t bundle those together. Have the writing team working on things that aren’t dependent upon the research team just in case the data isn’t delivered. If you’re building a house and you are planning on having things done that require a roof being installed you should have a plan for what happens if the roofers are behind or the shingles don’t arrive on schedule. Finding these extra bits of complexity and eliminating them will go a long way toward solving recurring sources of frustration.

Be Prepared for Problems

The motto of the Boy Scouts is “be prepared”. It’s something I constantly remind the youth in the program weekly. Be prepared for what exactly? It doesn’t matter what if you’re properly prepared. You don’t have to be prepared for every possible scenario but you need to have the flexibility to address a wide variety of potential problems.

Take information security, as a prime example. How will your enterprise be breached? There’s almost too many ways to consider. New zero day? Backdoor password installed years ago? Phishing your key employees? Good old fashioned malfeasance? The list of things are endless! But the results are always the same. Attackers look for things of value and either steal them or disable them. Thieves steal and chaotic souls cause chaos. The entry is unknown but the results of entry can be quantified and considered.

You may not know how they’ll get in but you know how to stop them once they do. That’s why you should always assume you’re under attack or already breached. If you construct the system in such a way as to prevent lateral movement or even create policies to keep data safe at rest you’ll go a long way to preventing unauthorized users from accessing it, malicious or otherwise.

Is assuming that you’re always under attack kind of paranoid? Yes, it is. However, if you assume you’ve been breached and you are wrong all you’ve done is ensure that your data is safe and secure. If you assume you’re not and you end up being wrong you get to spend a lot of time cleaning up and sending emails to your boss and your resume to the next place where you get to make all new assumptions.

Tom’s Take

The optimist in me wants to believe that you can plan something so well that there isn’t a chance a problem can happen. The realist in me knows the optimist is crazy. That doesn’t mean I should just stop planning and hope for the best when I need to tap dance my way out of a problem. Instead, it means that I need to consider all the possibilities and try to have an answer for them, event if they’re remote. That way I’m never caught off guard by the wackiest of issues.

Monitoring Other People’s Problems

Posted on February 3, 2023 by networkingnerd

It’s Always the Network is a refrain that causes operations teams to shudder. No matter what your flavor of networking might be it’s always your fault. Even if the actual problem is DNS, a global BGP outage, or even some issue with the SaaS provider. Why do we always get blamed? And how can you prevent this from happening to you?

User Utopia

Users don’t know about the world outside of their devices. As soon as they click on something in a browser window they expect it to work. It’s a lot like ordering a package and having it delivered. It’s expected that the package arrives. You don’t concern yourself with the details of how it needs to be shipped, what routes it will take, and how factors that exist half a world away could cause disruptions to your schedule at home.

The network is the same to the users. If something doesn’t work with a website or a remote application it must be the “network” that is at fault. Because your users believe that everything not inside of their computer is the network. Networking is the way that stuff happens everywhere else. As professionals we know the differences between the LAN, the WAN, the cloud, and all points in between. However your users do not.

Have you heard about MTTI? It’s a humorous metric that we like to quote now that stands for “mean time to innocence”. Essentially we’re saying that we’re racing to find out why it’s not our problem. We check the connections, look at the services in our network, and then confidently explain to the user that someone else is at fault for what happened and we can’t fix that. Sounds great in theory, right?

Would you proudly proclaim that it’s someone else’s fault to the CEO? The Board of Directors? How about to the President of the US? Depending on how important the person you are speaking to might be you might change the way you present the information. Regular users might get “sorry, not my fault” but the CEO could get “here’s the issue and we need to open a ticket to fix it”. Why the varying service levels? Is it because regular users aren’t as important as the head of the company? Or is it because we don’t want to put in the extra effort for a knowledge workers that we would gladly do for the person that in ultimately going to approve a raise for us if we do extra work? Do you see the fallacy here?

Keeping Your Eyes Peeled

The answer, of course, is that every user at least deserves an explanation of where the problem is. You might be bent on justifying that it’s not in your network so you don’t have to do any more work but you also did enough work to place the blame outside of your area of control. Why not go one step further? You could check a dashboard for services on a cloud provider to see if they’re degraded. Or do a quick scan of news sites or social media to see if it’s a more widespread issue. Even checking a service to see if the site is down for more than just you is more information than just “not my problem”.

The habit of monitoring other services allows you to do two things very quickly. First is that it lowers the all-important MTTI metric. If you can quickly scan the likely culprits for outages you can then say with certainty that it’s out of your control. However, the biggest benefit to monitoring services outside of your organization that you rely on is that you’ll have forewarning for issues in your own organization. If your users come to you and say that they can’t get to Microsoft 365 and you already know that’s because they messed up a WAN router configuration you’ll head off needless blame. You’ll also know where to look next time to understand challenges.

If you’re already monitoring the infrastructure in your organization it only makes sense to monitor the infrastructure that your users need outside of it. Maybe you can’t get AWS to send you the SNMP strings you need to watch all their internal switches but you can easily pull information from their publicly available sources to help determine where the issues might lie. That helps you in the same way that monitoring switch ports in your organization helps you today. You can respond to issues before they get reported so you have a clear picture of what the impact will be.

Tom’s Take

When I worked for a VAR one of my biggest pet peeves was “not my problem” thinking. Yes, it may not be YOUR problem but it is still A problem for the user. Rather than trying to shift the blame let’s put our heads together to investigate where the problem lies and create a fix or a workaround for it. Because the user doesn’t care who is to blame. They just want the issue resolved. And resolution doesn’t come from passing the buck.

Problem Replication or Why Do We Need to Break It Again?

Posted on January 6, 2023 by networkingnerd

There was a tweet the other day that posited that we don’t “need” to replicate problems to solve them. Ultimately the reason for the tweet was that a helpdesk refused to troubleshoot the problem until they could replicate the issue and the tweeter thought that wasn’t right. It made me start thinking about why troubleshooters are so bent on trying to make something happen again before we actually start trying to fix an issue.

The Definition of Insanity

Everyone by now has heard that the definition of insanity is doing the same thing over and over again and expecting a different result. While funny and a bit oversimplified the reality of troubleshooting is that you are trying to make it do something different with the same inputs. Because if you can make it do the same thing over and over again you’re closer to the root cause of the issue.

Root cause is the key to problem solving. If you don’t fix what’s actually wrong you are only dealing with symptoms and not issues. However, you can’t know what’s actually wrong until you can make it happen more than once. That’s because you have to narrow the actual issue down from all of the symptoms. If you do something and get the same output every time then you’re on the right track. However, if you change the inputs and get the same output you haven’t isolated the issue yet. It’s far too common in troubleshooting to change a bunch of things all at once and not realize what actually fixed the problem.

Without proper isolation you’re just throwing darts. As the above comic illustrates sometimes the victory is causing a different error message. That means you’re manipulating the right variables or inputs. It also means you know what knobs need to be turned in order to get closer to the root cause and the solution. So repeatability in this case is key because it means you’re not trying to fix things that aren’t really broken. You’re also narrowing the scope of the fixes by eliminating things that don’t need to be monitored.

Resource Utilization

How long does it take to write code? A couple of hours? A day? Does it take longer if you’re trying to do tests and make sure your changes don’t break anything else? What about shipping that code as part of the DevOps workflow? Or an out-of-band patch? Do you need to wait until the next maintenance window to implement the changes? These are all extremely valuable questions to ask to figure out the impact of changes on your environment.

Now multiply all of those factors by the number of things you tried that didn’t work. The number of times you thought you had the issue solved only to get the same error output. You can see how wasting hours of a programming team working on things can add up to the company’s bottom line quickly. Nothing is truly free or a sunk cost. You’re still paying people to work on something, whether it’s fixing buggy code or writing a new feature for your application. If they’re spending more of their time tracking down bugs that happen infrequently with no clear root cause are they being used to their highest potential?

What happens when that team can’t fix the issue because it’s too hard to cause it again? Do they bring in another team? Pay for support calls? Do you throw up your hands and just live with it? I’ve been involved in all of these situations in the past. I’ve tried to replicate issues to no avail only to be told that they’ll just live with the bug because they can’t afford to have me work on it any longer. I’ve also spent hours trying to replicate issues only to find out later that the message has only ever appeared once and no one knows what was going on when it happened. I might have been working for a VAR at the time but my time was something the company tracked closely. Knowing up front the issue had only ever happened once might have changed the way I worked on the issue.

I can totally understand why people would be upset that a help desk closed a ticket because they couldn’t replicate an issue. However, I also think that prudence should be involved in any structured troubleshooting practice. Having been on the other side of the Severity One all-hands-on-deck emergencies only to find a simple issue or a non-repeatable problem in the past I can say that having that many resources committed to a non-issue is also maddening as a tech and a manager of technical people.

Tom’s Take

Do you need to replicate a problem to troubleshoot it? No. But you also don’t need to follow a recipe to bake a cake. However, it does help immensely if you do because you’re going to get a consistent output that helps you figure out issues if they arise. The purpose of replicating issues when troubleshooting isn’t nefarious or an attempt to close tickets faster. Instead it’s designed to help speed problem isolation and ensure that resources are tasked with the right fixes to the issues instead of just spraying things and hoping one of them gets the thing taken care of.

Intelligence and Wisdom

Posted on September 26, 2022 by networkingnerd

I spent the last week at the Philmont Leadership Challenge in beautiful Cimarron, NM. I had the chance to learn a bit more about servant leadership and work on my outdoor skills a little. I also had some time to reflect on an interesting question posed to me by one of the members of my crew.

He asked me, “You seem wise. How did you get so wise?” This caught me flat-flooted for a moment because I’d never really considered myself to be a very wise person. Experienced perhaps but not wise like Yoda or Gandalf. So I answered him as I thought more about it.

Intelligence is knowing what to do. Wisdom is knowing what not to do.

The more I thought about that quote the more I realized the importance of the distinction.

Basic Botany

There’s another saying that people tweeted back at me when I shared the above quote. It’s used in the context of describing Intelligence and Wisdom for Dungeons and Dragons roleplaying:

Intelligence is knowing that a tomato is a fruit. Wisdom is not putting tomatoes in a fruit salad.

It’s silly and funny but it gets right to the point and is a different version of my other observation. Intelligence is all about the acquisition of knowledge. Think about your certification journey. You spend all your time learning the correct commands for displaying routing tables or how to debug a device and figure out what’s going on. You memorize arguments so you can pass the exam without the use of the question mark.

Intelligence is focused on making sure you have all the knowledge you can ever use. Whether it’s an arcane spell book or Routing TCP/IP Volume 1 you’re working with the kinds of information that you need to ingest in order to get things done. Think of it like a kind of race to amass a fortune in facts.

However, as pointed out above, intelligence is often lacking in the application of that knowledge. Assembling a storehouse full of facts doesn’t do much to help you when it comes to applying that knowledge to produce outcomes. You can be a very intelligent person and still not know what to do with it. You may have heard someone say that a person is “book smart” or is lacking is “common sense”. These are both ways to say that someone is intelligent by maybe not wise.

Applied Science

If intelligence is all about acquisition of knowledge then wisdom is focused on application. Just because you know what commands are used to debug a router doesn’t mean you need to use them all the time. There are apocryphal stories of freshly minted CCIEs walking in to the data center for an ISP and entering debug ip packet detail on the CLI only to watch the switch completely exhaust itself and crash in the middle of the day. The command was correct for what they wanted to accomplish. What was missing was the applied knowledge that a busy switch wouldn’t be able to handle the additional processing load of that much data being streamed to the console.

Wisdom isn’t gained from reading a book. It’s gained from applying knowledge to situations. No application of that knowledge is going to be perfect every time. You’re going to make mistakes. You’re going to do things that cause problems. You’re going to need to fix mistakes and learn as you go. Along the way you’re going to find a lot of things that don’t work for a given situation. That’s where wisdom is gained. You’re not failing. You’re learning what doesn’t work so you don’t apply it incorrectly.

A perfect example of this came just a couple of days ago. The power in my office was out which meant the Internet was down for everyone. A major crisis for sure! I knew I needed to figure out what was going on so I started the troubleshooting process. I knew how electricity worked and what needed to be checked. Along the way I kept working and trying to figure out where the problem was. The wisdom I gained along the way from working with series circuits and receptacles helped me narrow things down to one wall socket that had become worn out and needed to be replaced. More wisdom told me to make sure the power was turned off before I started working on the replacement.

I succeeded not because I knew what to do as much as knowing what not to do when applying the knowledge. I didn’t have to check plugs I knew weren’t working. I knew things could be on different circuits. I knew I didn’t have to mess with working sockets either. All the knowledge of resistance and current would only serve me correctly if I knew where to put it and how to work around the issues I saw in the application of that knowledge.

Not every piece of wisdom comes from unexpected outcomes. It’s often just as important to do something that works and see the result so you can remember it for the next time. The wisdom comes in knowing how to apply that knowledge and why it only works in certain situations. If you’ve every worked with someone that troubleshoots really complex problems with statements like “I tried this crazy thing once and it worked” you know exactly how this can be done.

Tom’s Take

Intelligence has always been my strong point. I read a lot and retain knowledge. I’m at home when I’m recalling trivia or absorbing new facts. However I’ve always worried that I wasn’t very wise. I make simple mistakes and often forget how to use the information I have on hand. However, when I shared the quote above I finally realized that all those mistakes were just me learning how to apply the knowledge I’d gained over time. Wisdom isn’t a passage in a book. It’s not a fact. It’s about knowing when to use it and when not to use it. It’s about learning in a different way that matters just as much as all the libraries in the world.

Redundancy Is Not Resiliency

Posted on September 16, 2022 by networkingnerd

Most people carry a spare tire in their car. It’s there in case you get a flat and need to change the tire before you can be on your way again. In my old VAR job I drove a lot away from home and to the middle of nowhere so I didn’t want to rely on roadside assistance. Instead I just grabbed the extra tire out of the back if I needed it and went on my way. However, the process wasn’t entirely hitless. Even the pit crew for a racing team needs time to change tires. I could probably get it done in 20 minutes with appropriate cursing but those were 20 minutes that I wasn’t doing anything else beyond fixing a tire.

Spare tires are redundant. You have an extra thing to replace something that isn’t working. IT operations teams are familiar with redundant systems. Maybe you have a cold spare on the shelf for a switch that might go down. You might have a cold or warm data center location for a disaster. You could even have redundant devices in your enterprise to help you get back in to your equipment if something causes it to go offline. Well, I say that you do. If you’re Meta/Facebook you didn’t have them this time last year.

Don’t mistake redundancy for resilience though. Like the tire analogy above you’re not going to be able to fix a flat while you’re driving. Yes, I’ve seen the crazy video online of people doing that but aside from stunt driving you’re going to have to take some downtime from your travel to fix the tire. Likewise, a redundant setup that includes cold spares or out-of-band devices that are connected directly to your network could incur downtime if they go offline and lock you out of your management system. Facebook probably thought their out-of-band control system worked just fine. Until it didn’t.

The Right Gear for Resilience

At Networking Field Day 29 last week we were fortunate to see Opengear present again for the second time. I’m familiar with them from all the way back at Networking Field Day 2 in 2011 so their journey through the changes of networking over the past decade has been great to see. They make out-of-band devices and they make them well. They’re one of the companies that immediately spring to mind when you think about solutions for getting access to devices outside the normal network access method.

As a VAR there were times that I needed to make calls to locations in order to reboot devices or get console access to fix an issue. Whether it was driving 3 hours to press F1 to clear a failed power supply message or racing across town to restore phone service after locking myself out of an SSH session there are numerous reasons why having actual physical access to the console is important. Until we perfect quantum teleportation we’re going to have to solve that problem with technology. Here’s a video from the Networking Field Day session that highlights some of the challenges and solutions that Opengear has available:

Ryan Hogg brings up a great argument here for redundancy versus resiliency. Are you managing your devices in-band? Or do you have a dedicated management network? And what’s your backup for that dedicated network if something goes offline? VLAN separation isn’t good enough. In the event of a failure mode, such as a bridging loop or another attack that takes a switch offline you won’t be able to access the management network if you can’t sent packets through it. If the tire goes flat you’re stopped until it’s fixed.

Opengear solves this problem in a number of ways. The first is of course providing a secondary access method to your network. Opengear console devices have a cellular backup function that can allow you to access them in the event of an outage, either from the internal network or from the Internet going down. I can think of a couple of times in my career where I would have loved to have been able to connect to a cellular interface to undo a change that just happened that had unintended consequences. Sometimes reload in 5 doesn’t quite do the job. Having a reliable way to connect to your core equipment makes life easy for network operating systems that don’t keep from making mistakes.

However, as mentioned, redundancy is not resiliency. It’s not enough for us to have access to fix the problem while everything is down and the world is on fire. We may be able to get back in and fix the issue without needing to drive to the site but the users in that location are still down while we’re working. SD-WAN devices have offered us diverse connectivity options for a number of years now. If the main broadband line goes down just fail back to the cellular connection for critical traffic until it comes back up. Easy to do now that we have the proper equipment to create circuit diversity.

As outlined in the video above, Opengear has the same capability as well. If you don’t have a fancy SD-WAN edge device you can still configure Opengear console devices to act as a secondary egress point. It’s as simple as configuring the network with an IP SLA to track the state of the WAN link and installing the cellular route in the routing table if that link goes down. Once configured your users can get out on the backup link while you’re coming in to fix whatever caused the issue. If it’s the ISP having the issue you can log a ticket and confirm things are working on-site without having to jump in a car to see what your users see.

Resilience Really Matters

One of the things that Opengear has always impressed me with is their litany of use cases for their devices. I can already think of a ton of ways that I could implement something like this for customers that need monitoring and resilient connectivity options. Remote offices are an obvious choice but so too are locations with terrible connectivity options.

If you are working in a location with spotty connectivity you can easily deploy an Opengear device to keep an eye on the network and/or servers as well as providing an extra way for the site to get back online in the event of an issue. If the WAN circuit goes down you can just hop over to the cellular link until you get it fixed. Opengear will tell you something happened and you can log into the Lighthouse central management system to go there and collect data. If configured correctly your users may not even realize they’re offline! We’re almost at the point of changing the tire while we’re driving.

Tom’s Take

I am often asked if I miss working on networking equipment since I rarely touch it these days. As soon as I’m compelled to answer that question I remember all the times I had to drive somewhere to fix an issue. Wasted time can never be recovered. Resources cost money whether it’s money for a device or time spent going to fix one. I look at the capabilities that a company like Opengear has today and I wish I had those fifteen years ago and could deploy them to places I know needed them. In my former line of work redundant things were hard to come by. Resilient options were much more appealing because they offered more than just a back plan in case of failure. You need to pick resiliency every time because otherwise you’re going to be losing time replacing that tire when you could be rolling along fixing it instead.

Disclaimer

Opengear was a presenter at Networking Field Day 29 on September 7, 2022. I am an employee of Tech Field Day, which is the company that managed the event. This blog post represents my own personal thoughts about the presentation and is not the opinion of Tech Field Day. Opengear did not provide compensation for this post or ask for editorial approval. This post is my perspective alone.

Helpdesk Skills Fit the Bill

Posted on April 8, 2022 by networkingnerd

I saw a great tweet yesterday from Swift on Security that talked about helpdesk work and how it’s nothing to be ashamed of:

In past I was afraid to reference me working in Helpdesk. Now I advertise it. You'd be surprised where the highest echelons of tech started, especially in a world where there was no career track much less a degree. People who actually get shit done don't shit on people for sport.

— SwiftOnSecurity (@SwiftOnSecurity) April 8, 2022

I thought it was especially important to call this out for my readers. I’ve made no secret that my first “real” job in IT was on the national helpdesk for Gateway Computers through a contractor. I was there for about six months before I got a job doing more enterprise-type support work. And while my skills today are far above what I did when I started out having customers search for red floppy disks and removing helper apps in MSCONFIG, the basics that I learned there are things I still carry with me today.

Rocket Science

Most people have a negative outlook on helpdesk work. They see it as entry-level and not worth admitting to. They also don’t quite understand just how hard it is to do this kind of work. It may be different today compared to when I did it twenty years ago through the advances in technology but there is one group of people that know just how hard it is to do remote work on a system you can’t physically touch:

NASA Engineers

If you think it’s easy to just jump in front of a keyboard and fix an issue with Outlook or Chrome, try doing it without being able to see what you’re looking at. Now try and describe what you want to do without technical jargon and without using the term “thingie”. Now do it with the patience that it takes to tell someone more than once what they need to do and how to get them to read you the error message verbatim so you can figure out what’s actually going on.

Inbound phone support is a lot like fix a space probe. You have a lag time between commands being sent and confirmation. You can’t make it too complicated because the memory of the thing you’re working with isn’t unlimited. And you have to hope what you did actually fixes the issue because you may not get a second chance at it. It might sound a touch hyperbolic but I’d argue that phone support teaches the skills needed to communicate effectively and concisely to people less technical than you. You know, executives.

Joking aside, being able to distill the issue from incomplete information and formulate a response while simultaneously describing it in non-technical terms to ease implementation is the kind of skill you don’t read out of a book. Don’t believe me? Call your parents and help them update their DNS settings. Unless your parents are in IT I doubt it’s going to be a fun call. Tech is obtuse on purpose to keep people from playing with it. The people that can cut through the jargon and keep users going are magicians.

Visualize the Result

I have a gift for analogies. They aren’t always 100% accurate but I can usually relate some kind of a technology to some real-world non-technical thing to help users understand what’s going on. That gift was developed in the six months I spent trying to explain bad capacitors, startup TSRs, and exactly what an FDISK, Format, Reload process did to the computer. It’s a super valuable skill to have outside of tech too.

Go ask your doctor to explain how the endocrine system works in non-medical terms. I bet they can because they have a way to describe most of the body functions in terms that someone who has never taken human anatomy can understand. You can’t explain technical subjects with technical terms to just anyone. And even the technical people that know what you’re saying often have a hard time visualizing what you’re talking about. That’s how we ended up with the “Internet as a series of tubes” meme.

Where does this skill come in handy? How about teaching advanced tech to your junior staff? Or educating the executives in a board meeting? Or even just trying to tell the users in your organization why it’s not the network this time? If you can only talk in technical jargon without giving analogies or helping them visualize what you’re saying you will only end up with confused angry users that think you’re being patronizing.

The helpdesk forces you to get better at being descriptive with shorter words and more visual descriptions. I’d argue that my helpdesk time made my writing better because it reminds me not to be loquacious when I should be economical in my word choices and sentence lengths. If you can’t help your user figure out what you’re doing they’ll never learn how to be better at telling people what they need support for.

Documenting All the Things

The last skill that the helpdesk instilled in me is documentation. I’m a bad note taker. I get distracted (thanks ADHD) and often forget to write down important things. Helpdesk work is a place where you absolutely need to write everything down. Ask questions and record info. Dig into error messages and find out when they started happening. Listen to what people are telling you and don’t jump to conclusions until you’ve written it all down.

I once had a call where the user told me that they couldn’t get into their computer. I thought this was an easy fix. I spent ten minutes walking through the Windows XP login process. Nothing worked. I was getting frustrated and was about to reboot to safe mode and start erasing password hashes. By chance the user mentioned that they saw the pop up for the buzzer and it still wasn’t working. After I questioned them on this I realized that they could log in to WinXP just fine. The “buzzer popup” was AOL’s login screen. They had an issue with the modem, not the operating system. If I’d asked more questions about when the problem started and what they saw instead of just jumping to the wrong conclusion I could have saved a ton of pain and wasted effort.

Likewise, when you’re doing IT work you have to write it all down. Troubleshooting makes a lot of sense here but so does implementation or other kinds of design work. If you’re doing a wireless survey on site and forget to write down what the walls are made out of you may find yourself with an ineffective design or, worse having to spend extra time and money fixing it on the fly because those reinforced concrete walls are way better at signal attenuation than you recalled from a spotty memory.

Get in the habit of taking good notes from the start and summarizing them when needed to weed out the working things from the non-working things. Honestly, having a record of all the steps you took to arrive at your conclusion helps in the future if the issue happens again or you find yourself at a brick wall and need to retrace your steps to figure out where you took the wrong turn. If you record it all you won’t need to spend too much effort scratching your head to figure out what you were thinking.

Tom’s Take

Every job teaches you skills you can carry forward. Even the worst job in the world teaches you something you can take with you. Some jobs have a negative stigma and really shouldn’t. Like Swift said above you should accentuate the positive aspects of your journey. Don’t look at the helpdesk as a slog through the trenches. See yourself as a NASA rocket scientist that can talk to normal people and document like a champion. That’s a career anyone would be proud to have.

A Gift Guide for Sanity In Your Home IT Life

Posted on November 26, 2021 by networkingnerd

If you’re reading my blog you’re probably the designated IT person for your family or immediate friend group. Just like doctors that get called for every little scrape or plumbers that get the nod when something isn’t draining over the holidays, you are the one that gets an email or a text message when something pops up that isn’t “right” or has a weird error message. These kinds of engagements are hard because you can’t just walk away from them and you’re likely not getting paid. So how can you be the Designated Computer Friend and still keep your sanity this holiday season?

The answer, dear reader, is gifts. If you’re struggling to find something to give your friends that says “I like you but I also want to reduce the number of times that you call me about your computer problems” then you should definitely read on for more info! Note that I’m not going to fill this post will affiliate links or plug products that have sponsored anything. Instead, I’m going to just share the classes or types of devices that I think are the best way to get control of things.

Step 1: Infrastructure Upgrades

When you go visit your parents for Thanksgiving or some other holiday check in, are they still running the same wireless network they got when they got their high-speed Internet? Is their Wi-Fi SSID still the default with the password printed on the side of the router/modem combo? Then you’re going to want to upgrade their experience to keep your sanity for the next few holidays.

The first thing you need to do it get control of their wireless setup. You need to get some form of wireless access point that wasn’t manufactured in the early part of the century. Most of the models on the market have Wi-Fi 6 support now. You don’t need to go crazy with a Wi-Fi 6E model for your loved ones right now because none of their devices will support it. You just need something more modern with a user interface that wasn’t written to look like Windows 3.1.

You also need to see about an access point that is controlled via a cloud console. If you’re the IT person in the group you probably already use some form control for your home equipment. You don’t need a full Meraki or Juniper Mist setup to lighten your load. That is, unless you already have one of those dashboards set up and you have spare capacity. Otherwise you could look at something like Ubiquiti as a middle ground.

Why a cloud controller AP? Because then you can log in and fix things or diagnose issues without needing to spend time talking to less technical users. You can find out if they have an unstable Internet connection or change SSID passwords at the drop of a hat. You can even set up notifications for those remote devices to let you know when a problem happens so you can be ready and waiting for the call. And you can keep tabs on necessary upgrades and such so you aren’t fielding calls when the next major exploit comes out and your parents call you asking if they’re going to get infected by this virus. You can just tell them they’re up-to-date and good to go. The other advantage of this method is that when you upgrade your own equipment at home you can just waterfall the old functional gear down to them and give them a “new to you” upgrade that they’ll appreciate.

Step 2: Device Upgrades

My dad was notorious for using everything long past the point of needing to be retired. It’s the way he was raised. If there’s a hole you patch it. If it breaks you fix it. If that fix doesn’t work you wrap it in duct tape and use it until it crumbles to dust. While that works for the majority of things out there it does cause issues with technology far too often.

He had a iPad that he loved. He didn’t use it all day, every day but he did use it frequently enough to say that it was his primary computing device. It was a fourth-generation device, so it fell out of fashion a few years ago. When he would call me and ask me questions about why it was behaving a certain way or why he couldn’t download some new app from the App Store I would always remind him that he had an older device that wasn’t fast enough or new enough to run the latest programs or even operating software. This would usually elicit a grumble or two and then we would move on.

If you’re the Designated IT Person and you spend half your time trying to figure out what versions of OS and software are running on a device, do yourself a favor and invest in a new device for your users just to ease the headaches. If they use a tablet as their primary computing device, which many people today do, then just buy a new one and help them migrate all the data across to the new one while you’re eating turkey or opening presents.

Being on later hardware ensures that the operating system is the latest version with all the patches for security that are needed to keep your users safe. It also means you’re not trying to figure out what the last supported version of the software was that works with the rest of the things. I’ve played this game trying to get an Apple Watch to connect to an older phone with mismatched software as well as trying to get support for newer wireless security on older laptops with very little capability to do much more than WPA1. The amount of hours I burned trying to make the old junk work with the new stuff would have been better served just buying a new version of the same old thing and getting all their software moved over. Problems seem to just disappear when you are running on something that was manufactured within the last five years.

Step 3: Help Them Remember

This is probably my biggest request: Forgotten passwords. Either it’s the forgotten Apple ID or maybe the wireless network password. My parents and in-laws forget the passwords they need to log into things all the time. I finally broke down and taught them how to use a password management tool a few years ago and it made all the difference in the world. Now, instead of them having to remember what their password was for a shopping site they can just set it to automatically fill everything in. And since they only need to remember the master password for their app they don’t have to change it.

Better yet, most of these apps have a secure section for notes. So all those other important non-password things that seem to come up all the time are great to put in here. Social Security Numbers, bank account numbers, and so much more can be put in one central location and made easy to access. The best part? If you make it a shared vault you can request access to help them out when they forget how to get in. Or you can be designated as a trusted party that can access the account in the event of a tragedy. Getting your loved ones used to using password vaults now makes it much easier to have them storing important info there in case something happens down the road that requires you to jump in without their interaction. Trust me on this.

Tom’s Take

Your loved ones don’t need knick knacks and useless junk. If you want to show them you love them, give them the gift of not having to call you every couple of days because they can’t remember the wireless password or because they keep getting this error that says their app isn’t support on this device. Invest in your sanity and their happiness by giving them something that works and that has the ability for you to help manage it from the background. If you can make it stable and useful and magically work before they call you with a problem you’re going to find yourself a happier person in the years to come.

The Process Will Save You

Posted on November 12, 2021 by networkingnerd

I had the opportunity to chat with my friend Chris Marget (@ChrisMarget) this week for the first time in a long while. It was good to catch up with all the things that have been going on and reminisce about the good old days. One of the topics that came up during our conversation was around working inside big organizations and the way that change processes are built.

I worked at IBM as an intern 20 years ago and the process to change things even back then was arduous. My experience with it was the deployment procedures to set up a new laptop. When I arrived the task took an hour and required something like five reboots. By the time I left we had changed that process and gotten it down to half an hour and only two reboots. However, before we could get the new directions approved as the procedure I had to test it and make sure that it was faster and produced the same result. I was frustrated but ultimately learned a lot about the glacial pace of improvements in big organizations.

Slow and Steady Finishes the Race

Change processes work to slow down the chaos that comes from having so many things conspiring to cause disaster. Probably the most famous change management framework is the Information Technology Infrastructure Library (ITIL). That little four-letter word has caused a massive amount of headaches in the IT space. Stage 3 of ITIL is the one that deals with changes in the infrastructure. There’s more to ITIL overall, including asset management and continual improvement, but usually anyone that takes ITIL’s name in vain is talking about the framework for change management.

This isn’t going to be a post about ITIL specifically but about process in general. What is your current change management process? If you’re in a medium to large sized shop you probably have a system that requires you to submit changes, get the evaluated and approved, and then implemented on a schedule during a downtime window. If you’re in a small shop you probably just make changes on the fly and hope for the best. If you work in DevOps you probably call them “deployments” and they happen whenever someone pushes code. Whatever the actual name for the process is you have one whether you realize it or not.

The true purpose of change management is to make sure what you’re doing to the infrastructure isn’t going to break anything. As frustrating as it is to have to go through the process every time the process is the reason why. You justify your changes and evaluate them for impact before scheduling them. As opposed to something that can be termed as “Change and find out” kind of methodologies.

Process is ugly and painful and keeps you from making simple mistakes. If every part of a change form needs to be filled out you’re going to complete it to make sure you have all the information that is needed. If the change requires you to validate things in a lab before implementation then it’s forcing you to confirm that it’s not going to break anything along the way. There’s even a process exception for emergency changes and such that are more focused on getting the system running as opposed to other concerns. But whatever the process is it is designed to save you.

ITIL isn’t a pain in the ass on accident. It’s purposely built to force your justify and document at every step of the process. It’s built to keep you from creating disaster by helping you create the paper trail that will save you when everything goes wrong.

Saving Your Time But Not Your Sanity

I used to work with a great engineer name John Pross. John wrote up all the documentation for our migrations between versions of software, including Novell NetWare and Novell Groupwise. When it came time to upgrade our office Groupwise server there was some hesitation on the part of the executive suite because they were worried we were going to run into an error and lock them out of their email. The COO asked John if he had a process he followed for the migration. John’s response was perfect in my mind:

“Yes, and I treat every migration like the first one.”

What John meant is that he wasn’t going to skip steps or take shortcuts to make things go faster. Every part of the procedure was going to be followed to the letter. And if something came up that didn’t match what he thought the output should have been it was going to stop until he solved that issue. John was methodical like that.

People like to take shortcuts. It’s in our nature to save time and energy however we can. But shortcuts are where the change process starts falling apart. If you do something different this time compared to the last ten times you’ve done it because you’re in a hurry or you think this might be more efficient without testing it you’re opening yourself up for a world of trouble. Maybe not this time but certainly down the road when you try to build on your shortcut even more. Because that’s the nature of what we do.

As soon as you start cutting corners and ignoring process you’re going to increase the risk of creating massive issues rapidly. Think about something as simple as the Windows Server 2003 shutdown dialog box. People used to reboot a server on a whim. In Windows 2003, the server had a process that required you to type in a reason why you were manually shutting the server down from the console. Most people that rebooted the server fell into two camps: Those that followed their process and typed in the reason for the reboot and those that just typed “;Lea;lksjfa;ldkjfadfk” as the reason and then were confused six months from now when doing the post-mortem on an issue and cursing their snarky attitude toward reboot documentation.

Saving the Day

Change process saves you in two ways. The first is really apparent: it keeps you from making mistakes. By forcing you to figure out what needs to happen along the way and document the whole process from start to finish you have all the info you need to make things successful. If there’s an opportunity to catch mistakes along the way you’re going to have every opportunity to do that.

The second way change process saves you is when it fails. Yes, no process is perfect and there are more than a few times when the best intentions coupled with a flaw in the process created a beautiful disaster that gave everyone lots of opportunity to learn. The question always comes back to what was learned in that process.

Bad change failures usually lead to a sewer pipe of blame being pointed in your direction. People use process failures as a change to deflect blame and avoid repercussions for doing something they shouldn’t have or trying to increase their stock in the company. The truly honest failure analysis doesn’t blame anyone but the failed process and tries to find a way to fix it.

Chris told me in our conversation that he loved ITIL at one of his former jobs because every time it failed it led to a meaningful change in the process to avoid failure in the future. These are the reasons why blameless post-mortem discussions are so important. If the people followed the process and the process the people aren’t at fault. The process is incorrect or flawed and needs to be adjusted.

It’s like a recipe. If the instructions tell you to cook something for a specific amount of time and it’s not right, who is to blame? Is it you because you did what you were told? Is the recipe? Is it the instructions? If you start with the idea that you did the process right and start trying to figure out where the process is wrong you can fix the process for next time. Maybe you used a different kind of ingredient that needs more time. Or you made it thinner than normal and that meant cooking it too long this time. Whatever the result, you end up documenting the process and changing things for the future to prevent that mistake from happening again.

Of course, just like all good frameworks, change processes shouldn’t be changed without analysis. Because changing something just to save time or take a shortcut defeats the whole purpose! You need to justify why changes are necessary and prove they provide the same benefit with no additional exposure or potential loss. Otherwise you’re back to making changes and hoping you don’t get burned this time.

Tom’s Take

ITIL didn’t really become a popular thing until after I left IBM but I’m sure if I were still there I’d be up to my eyeballs in it right now. Because ITIL was designed to keep keyboard cowboys like me from doing things we really shouldn’t be doing. Change management process are designed to save us at every step of the way and make us catch our errors before they become outages. The process doesn’t exist to make our lives problematic. That’s like saying a seat belt in a car only exists to get in my way. It may be a pain when you’re dealing with it regularly but when you need it you’re going to wish you’d been using it the whole time. Trust in the process and you will be saved.

What Can You Learn From Facebook’s Meltdown?

Posted on October 8, 2021 by networkingnerd

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.

Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

The Method

Memory Aids

Tom’s Take

Share this:

Complexity Fails

Be Prepared for Problems

Tom’s Take

Share this:

User Utopia

Keeping Your Eyes Peeled

Tom’s Take

Share this:

The Definition of Insanity

Resource Utilization

Tom’s Take

Share this:

Basic Botany

Applied Science

Tom’s Take

Share this:

The Right Gear for Resilience

Resilience Really Matters

Tom’s Take

Disclaimer

Share this:

Rocket Science

Visualize the Result

Documenting All the Things

Tom’s Take

Share this:

Step 1: Infrastructure Upgrades

Step 2: Device Upgrades

Step 3: Help Them Remember

Tom’s Take

Share this:

Slow and Steady Finishes the Race

Saving Your Time But Not Your Sanity

Saving the Day

Tom’s Take

Share this:

Ludicrous Case Scenarios

Fruits of the Poisoned Tree

Thus Endeth The Lesson

Tom’s Take

Share this: