There was a tweet the other day that posited that we don’t “need” to replicate problems to solve them. Ultimately the reason for the tweet was that a helpdesk refused to troubleshoot the problem until they could replicate the issue and the tweeter thought that wasn’t right. It made me start thinking about why troubleshooters are so bent on trying to make something happen again before we actually start trying to fix an issue.
The Definition of Insanity
Everyone by now has heard that the definition of insanity is doing the same thing over and over again and expecting a different result. While funny and a bit oversimplified the reality of troubleshooting is that you are trying to make it do something different with the same inputs. Because if you can make it do the same thing over and over again you’re closer to the root cause of the issue.
Root cause is the key to problem solving. If you don’t fix what’s actually wrong you are only dealing with symptoms and not issues. However, you can’t know what’s actually wrong until you can make it happen more than once. That’s because you have to narrow the actual issue down from all of the symptoms. If you do something and get the same output every time then you’re on the right track. However, if you change the inputs and get the same output you haven’t isolated the issue yet. It’s far too common in troubleshooting to change a bunch of things all at once and not realize what actually fixed the problem.
Without proper isolation you’re just throwing darts. As the above comic illustrates sometimes the victory is causing a different error message. That means you’re manipulating the right variables or inputs. It also means you know what knobs need to be turned in order to get closer to the root cause and the solution. So repeatability in this case is key because it means you’re not trying to fix things that aren’t really broken. You’re also narrowing the scope of the fixes by eliminating things that don’t need to be monitored.
How long does it take to write code? A couple of hours? A day? Does it take longer if you’re trying to do tests and make sure your changes don’t break anything else? What about shipping that code as part of the DevOps workflow? Or an out-of-band patch? Do you need to wait until the next maintenance window to implement the changes? These are all extremely valuable questions to ask to figure out the impact of changes on your environment.
Now multiply all of those factors by the number of things you tried that didn’t work. The number of times you thought you had the issue solved only to get the same error output. You can see how wasting hours of a programming team working on things can add up to the company’s bottom line quickly. Nothing is truly free or a sunk cost. You’re still paying people to work on something, whether it’s fixing buggy code or writing a new feature for your application. If they’re spending more of their time tracking down bugs that happen infrequently with no clear root cause are they being used to their highest potential?
What happens when that team can’t fix the issue because it’s too hard to cause it again? Do they bring in another team? Pay for support calls? Do you throw up your hands and just live with it? I’ve been involved in all of these situations in the past. I’ve tried to replicate issues to no avail only to be told that they’ll just live with the bug because they can’t afford to have me work on it any longer. I’ve also spent hours trying to replicate issues only to find out later that the message has only ever appeared once and no one knows what was going on when it happened. I might have been working for a VAR at the time but my time was something the company tracked closely. Knowing up front the issue had only ever happened once might have changed the way I worked on the issue.
I can totally understand why people would be upset that a help desk closed a ticket because they couldn’t replicate an issue. However, I also think that prudence should be involved in any structured troubleshooting practice. Having been on the other side of the Severity One all-hands-on-deck emergencies only to find a simple issue or a non-repeatable problem in the past I can say that having that many resources committed to a non-issue is also maddening as a tech and a manager of technical people.
Do you need to replicate a problem to troubleshoot it? No. But you also don’t need to follow a recipe to bake a cake. However, it does help immensely if you do because you’re going to get a consistent output that helps you figure out issues if they arise. The purpose of replicating issues when troubleshooting isn’t nefarious or an attempt to close tickets faster. Instead it’s designed to help speed problem isolation and ensure that resources are tasked with the right fixes to the issues instead of just spraying things and hoping one of them gets the thing taken care of.