Imagine you’re deep into a massive issue. You’ve been troubleshooting for hours trying to figure out why something isn’t working. You’ve pulled in resources to help and you’re on the line with the TAC to try and get a resolution. You know this has to be related to something recent because you just got notified about it yesterday. You’re working through logs and configuration setting trying to gain insights into what went wrong. That’s when the TAC engineer hits you with with an armor-piecing question:
When did this start happening?
Now you’re sunk. When did you first start seeing it? Was it happening before and no one noticed? Did a tree fall in the forest and no one was around to hear the sound? What is the meaning of life now?
It’s not too hard to imagine the above scenario because we’ve found ourselves in it more times than we can count. We’ve started working on a problem and traced it back to a root cause only to find out that the actual inciting incident goes back even further than that. Maybe the symptoms just took a while to show up. Perhaps someone unknowingly “fixed” the issue with a reboot or a process reload over and over again until it couldn’t work any longer. How do we find ourselves in this mess? And how do we keep it from happening?
Quirky Worky
Glitches happen. Weird bugs crop up temporarily. It happens every day. I had to reboot my laptop the other day after being up for about two months because of a series of weird errors that I couldn’t resolve. Little things that weren’t super important that eventually snowballed into every service shutting down and forcing me to restart. But what was the cause? I can’t say for sure and I can’t tell you when it started. Because I just ignored the little glitches until the major ones forced me to do something.
Unless you work in security you probably ignore little strange things that happen. Maybe an application takes twice as long to load one morning. Perhaps you click on a link and it pops up a weird page before going through to the actual website. You could even see a storage notification in the logs for a router that randomly rebooted itself for no reason in the middle of the night. Occasional weirdness gets dismissed by us because we’ve come to expect that things are just going to act strangely. I once saw a quote from a developer that said, “If you think building a steady-state machine is easy, just look at how many issues are solved with a reboot.”
We tend to ignore weirdness unless it presents itself as a more frequent issue. If a router reboots itself once we don’t think much about it. The fourth time it reboots itself in a day we know we’ve got a problem we need to fix. Could we have solved it when the first reboot happened? Maybe. But we also didn’t know we were looking at a pattern of behavior either. Human brains are wired to look for patterns of things and pick them out. It’s likely an old survival trait of years gone by that we apply to current technology.
Notice that I said “unless you work in security”. That’s because the security folks have learned over many years and countless incidents that nothing is truly random or strange. They look for odd remote access requests or strange configuration changes on devices. They wonder why random IP addresses from outside the company are trying to access protected systems. Security professionals treat every random thing as a potential problem. However, that kind of behavior also demonstrates the downside of analyzing every little piece of information for potential threats. You quickly become paranoid about everything and spend a lot of time and energy trying to make sense out of potential nonsense. Is it any wonder that many security pros find themselves jumping at every little shadow in case it’s hiding a beast?
Middle Ground of Mentions
On the one hand, we have systems people that dismiss weirdness until it’s a pattern. On the other we have security pros that are trying to make patterns out of the noise. I’m sure you’re probably wondering if there has to be some kind of middle ground to ensure we’re keeping track of issues without driving ourselves insane.
In fact, there is a good policy that you need to get into the habit of doing. You need to write it all down somewhere. Yes, I’m talking about the dreaded documentation monster. The kind of thing that no one outside of developers likes to do. The mean, nasty, boring process of taking the stuff in your brain and putting it down somewhere so someone can figure out what you’re thinking without the need to read your mind.
You have to write it down because you need to have a record to work from if something goes wrong later. One of the greatest features I’ve ever worked with that seems to be ignored by just about everyone is the Windows Shutdown Reason dialog box in Windows Server 2003 and above. Rebooting a box? You need to write in why and give a good justification. That way if someone wants to know why the server was shut off at 11:15am on a Tuesday they can examine the logs. Unfortunately in my experience the usual reason for these shutdowns was either “a;lkjsdfl;kajsdf” or “because I am doing it”. Which aren’t great justifications for later.
You don’t have to be overly specific with your documentation but you need to give enough detail so that later you can figure out if this is part of a larger issue. Did an application stop responding and need to be restarted? Jot that down. Did you need to kill a process to get another thing running again? Write down that exact sentence. If you needed to restart a router and you ended up needing to restore a configuration you need to jot that part down too. Because you may not even realize you have an issue until you have documentation to point it out.
I can remember doing a support call years ago with a customer and in conversation he asked me if I knew much about Cisco routers. I chuckled and said I knew a bit. He said that he had one that he kept having to copy the configuration files to every time it restarted because it came up blank. He even kept a console cable plugged into it for just that reason. Any CCNA out there knows that’s probably a config register issue so I asked when it started happening. The person told me at least a year ago. I asked if anyone had to get into the router because of a forgotten password or some other lockout. He said that he did have someone come out a year and a half prior to reset a messed up password. Ever since then he had to keep putting the configuration back in. Sure enough, the previous tech hadn’t reset the config register. One quick fix and the customer was very happy to not have to worry about power outages any longer.
Documenting when things happen means you can build a timeline per device or per service to understand when things are acting up. You don’t even need to figure it out yourself. The magic of modern systems lies in machine learning. You may think to yourself that machine learning is just fancy linear regression at this point and you would be right more often than not. But one thing linear regression is great at doing is surfacing patterns of behavior for specific data points. If your router reboots on the third Wednesday of every month precisely at 3:33am the ML algorithms will pick up on that and tell you about it. But that’s only if your system catches the reboot through logs or some other record keeping. That’s why you have to document all your weirdness. Because the ML systems can analyze what they don’t know about.
Tom’s Take
I love writing. And I hate documentation. Documentation is boring, stuffy, and super direct. It’s like writing book reports over and over again. I’d rather write a fun blog post or imagine an exciting short story. However, documentation of issues is critical to modern organizations because these issues can spiral out of hand before you know it. If you don’t write it down it didn’t happen. And you need to know when it happened if you hope to prevent it in the future.