Monitoring Other People’s Problems

It’s Always the Network is a refrain that causes operations teams to shudder. No matter what your flavor of networking might be it’s always your fault. Even if the actual problem is DNS, a global BGP outage, or even some issue with the SaaS provider. Why do we always get blamed? And how can you prevent this from happening to you?

User Utopia

Users don’t know about the world outside of their devices. As soon as they click on something in a browser window they expect it to work. It’s a lot like ordering a package and having it delivered. It’s expected that the package arrives. You don’t concern yourself with the details of how it needs to be shipped, what routes it will take, and how factors that exist half a world away could cause disruptions to your schedule at home.

The network is the same to the users. If something doesn’t work with a website or a remote application it must be the “network” that is at fault. Because your users believe that everything not inside of their computer is the network. Networking is the way that stuff happens everywhere else. As professionals we know the differences between the LAN, the WAN, the cloud, and all points in between. However your users do not.

Have you heard about MTTI? It’s a humorous metric that we like to quote now that stands for “mean time to innocence”. Essentially we’re saying that we’re racing to find out why it’s not our problem. We check the connections, look at the services in our network, and then confidently explain to the user that someone else is at fault for what happened and we can’t fix that. Sounds great in theory, right?

Would you proudly proclaim that it’s someone else’s fault to the CEO? The Board of Directors? How about to the President of the US? Depending on how important the person you are speaking to might be you might change the way you present the information. Regular users might get “sorry, not my fault” but the CEO could get “here’s the issue and we need to open a ticket to fix it”. Why the varying service levels? Is it because regular users aren’t as important as the head of the company? Or is it because we don’t want to put in the extra effort for a knowledge workers that we would gladly do for the person that in ultimately going to approve a raise for us if we do extra work? Do you see the fallacy here?

Keeping Your Eyes Peeled

The answer, of course, is that every user at least deserves an explanation of where the problem is. You might be bent on justifying that it’s not in your network so you don’t have to do any more work but you also did enough work to place the blame outside of your area of control. Why not go one step further? You could check a dashboard for services on a cloud provider to see if they’re degraded. Or do a quick scan of news sites or social media to see if it’s a more widespread issue. Even checking a service to see if the site is down for more than just you is more information than just “not my problem”.

The habit of monitoring other services allows you to do two things very quickly. First is that it lowers the all-important MTTI metric. If you can quickly scan the likely culprits for outages you can then say with certainty that it’s out of your control. However, the biggest benefit to monitoring services outside of your organization that you rely on is that you’ll have forewarning for issues in your own organization. If your users come to you and say that they can’t get to Microsoft 365 and you already know that’s because they messed up a WAN router configuration you’ll head off needless blame. You’ll also know where to look next time to understand challenges.

If you’re already monitoring the infrastructure in your organization it only makes sense to monitor the infrastructure that your users need outside of it. Maybe you can’t get AWS to send you the SNMP strings you need to watch all their internal switches but you can easily pull information from their publicly available sources to help determine where the issues might lie. That helps you in the same way that monitoring switch ports in your organization helps you today. You can respond to issues before they get reported so you have a clear picture of what the impact will be.

Tom’s Take

When I worked for a VAR one of my biggest pet peeves was “not my problem” thinking. Yes, it may not be YOUR problem but it is still A problem for the user. Rather than trying to shift the blame let’s put our heads together to investigate where the problem lies and create a fix or a workaround for it. Because the user doesn’t care who is to blame. They just want the issue resolved. And resolution doesn’t come from passing the buck.

Problem Replication or Why Do We Need to Break It Again?

There was a tweet the other day that posited that we don’t “need” to replicate problems to solve them. Ultimately the reason for the tweet was that a helpdesk refused to troubleshoot the problem until they could replicate the issue and the tweeter thought that wasn’t right. It made me start thinking about why troubleshooters are so bent on trying to make something happen again before we actually start trying to fix an issue.

The Definition of Insanity

Everyone by now has heard that the definition of insanity is doing the same thing over and over again and expecting a different result. While funny and a bit oversimplified the reality of troubleshooting is that you are trying to make it do something different with the same inputs. Because if you can make it do the same thing over and over again you’re closer to the root cause of the issue.

Root cause is the key to problem solving. If you don’t fix what’s actually wrong you are only dealing with symptoms and not issues. However, you can’t know what’s actually wrong until you can make it happen more than once. That’s because you have to narrow the actual issue down from all of the symptoms. If you do something and get the same output every time then you’re on the right track. However, if you change the inputs and get the same output you haven’t isolated the issue yet. It’s far too common in troubleshooting to change a bunch of things all at once and not realize what actually fixed the problem.

Without proper isolation you’re just throwing darts. As the above comic illustrates sometimes the victory is causing a different error message. That means you’re manipulating the right variables or inputs. It also means you know what knobs need to be turned in order to get closer to the root cause and the solution. So repeatability in this case is key because it means you’re not trying to fix things that aren’t really broken. You’re also narrowing the scope of the fixes by eliminating things that don’t need to be monitored.

Resource Utilization

How long does it take to write code? A couple of hours? A day? Does it take longer if you’re trying to do tests and make sure your changes don’t break anything else? What about shipping that code as part of the DevOps workflow? Or an out-of-band patch? Do you need to wait until the next maintenance window to implement the changes? These are all extremely valuable questions to ask to figure out the impact of changes on your environment.

Now multiply all of those factors by the number of things you tried that didn’t work. The number of times you thought you had the issue solved only to get the same error output. You can see how wasting hours of a programming team working on things can add up to the company’s bottom line quickly. Nothing is truly free or a sunk cost. You’re still paying people to work on something, whether it’s fixing buggy code or writing a new feature for your application. If they’re spending more of their time tracking down bugs that happen infrequently with no clear root cause are they being used to their highest potential?

What happens when that team can’t fix the issue because it’s too hard to cause it again? Do they bring in another team? Pay for support calls? Do you throw up your hands and just live with it? I’ve been involved in all of these situations in the past. I’ve tried to replicate issues to no avail only to be told that they’ll just live with the bug because they can’t afford to have me work on it any longer. I’ve also spent hours trying to replicate issues only to find out later that the message has only ever appeared once and no one knows what was going on when it happened. I might have been working for a VAR at the time but my time was something the company tracked closely. Knowing up front the issue had only ever happened once might have changed the way I worked on the issue.

I can totally understand why people would be upset that a help desk closed a ticket because they couldn’t replicate an issue. However, I also think that prudence should be involved in any structured troubleshooting practice. Having been on the other side of the Severity One all-hands-on-deck emergencies only to find a simple issue or a non-repeatable problem in the past I can say that having that many resources committed to a non-issue is also maddening as a tech and a manager of technical people.


Tom’s Take

Do you need to replicate a problem to troubleshoot it? No. But you also don’t need to follow a recipe to bake a cake. However, it does help immensely if you do because you’re going to get a consistent output that helps you figure out issues if they arise. The purpose of replicating issues when troubleshooting isn’t nefarious or an attempt to close tickets faster. Instead it’s designed to help speed problem isolation and ensure that resources are tasked with the right fixes to the issues instead of just spraying things and hoping one of them gets the thing taken care of.

Who Wants to Be Supported Forever?

I saw an interesting thread today on Reddit talking about using networking equipment past the End of Life. It’s a fun read that talks about why someone would want to do something like this and how you might find yourself in some trouble depending on your company policies and such. But I wanted to touch on something that I think we skip over when we get here. What does the life of the equipment really mean?

It’s a Kind of Magic

As someone that uses equipment of all kinds, the lifetime of that equipment means something different for me than it does for vendors. When I think of how long something lasts I think of it in terms of how long I can use it until it is unable to be repaired any further. A great example of this is a car. All of my life I have driven older used cars that I continue to fix over and over until they have a very high mileage or my needs change and I must buy something different.

My vehicles don’t have a warranty or any kind of support, necessarily. If I need something fixed I either fix it myself or I take it to a mechanic to have it fixed and pay the costs associated with that. I’m not the kind of person that will get rid of a car just because the warranty has run out and it’s no longer under support.

The market of replacement parts and mechanics is made for people like me. Why buy something new when you can buy the parts to repair what you have? Sadly, that market exists for automobiles but not for IT gear. It’s hard to pull a switch apart and resolder capacitors when you feel like it. Most of the time you want to replace something that has failed and is end-of-life, you need to have a spare of the same kind of equipment sitting on the shelf. You can put the configuration files you need on it and put it in place and just keep going. The old equipment is tossed aside or recycled in some way.

Infrastructure gear will work well past the End of Support (EoS) or End of Life (EoL) dates. It will keep passing packets until something breaks, either the hardware in the ports or the CPU that drives it. Maybe it’s even a power supply. Usually it’s not the end of the life of the hardware that causes the issues. Instead, it’s the software concerns.

Supported Rhapsody

I’d venture a guess that most people reading this blog have some kind of mobile device sitting in their junk drawer or equipment box that works perfectly fine yet isn’t used because the current version of software doesn’t support that model. Software is a bigger determining factor the end of life of IT equipment than anything else.

Consumer gear needs to be fast to keep users happy. Phones, laptops, and other end user devices have to keep up with the desires of the people running applications and playing games and recording video. These devices iterate quickly and fall out of support just as fast. Your old iPhone or Galaxy probably won’t run the latest code which has a feature you really want to use so you move on to something newer.

Infrastructure gear doesn’t have software go out of fashion quite as quickly as a mobile phone but it does have another issue that causes grief: security patches. You may not be chasing a new feature in a software release but you’re either looking to patch a bug that is affecting performance or patch a security hole that has the potential to cause issues for your organization.

Security patches for old hardware are a constant game of cat-and-mouse. Whether it’s an old operating system running important hardware terminals or the need to keep an old terminal running because the application software doesn’t work on anything newer IT departments are often saddled with equipment that’s out of date and insecure. Do you keep running it, hoping that there will still be patches for it? Or do you try to move to something newer and hope that you can make things work? Anyone that is trying to use software that requires versions of Microsoft Internet Explorer knows that pain can last for years.

The real issue here doesn’t have anything to do with support of the equipment or the operating system. Instead, it has everything to do with the support that you can provide for it. A mechanic isn’t magical. They learn how to get parts for cars and fix the problems as they arise. They become the support structure instead of the dealership that offers the warranty. Likewise, as an IT pro you become the support structure for an old switch or old operating system. If you can’t download patches for it you either need to provide a workaround for issues or find a way to create solutions to address it. It’s not end of support as much as it’s end of someone else’s support.


Tom’s Take

I use things until they don’t work. Then I fix them and use them more. It’s the way I was raised. I have a hard time getting new things simply because they’re supported or a little bit faster. I might upgrade my phone to get a new feature but you can bet I’m going to find a way to use my old one longer, either by giving it to one of my kids or using it in a novel way. I hate getting rid of technology that has some use left in it. End of Life for me really does mean the moment when it stops sending bits or storing data. As long as there is no other conflict in the system you can count on me extending the life of a device long past the date that someone says they don’t want to deal with it any more.

The Process Will Save You

I had the opportunity to chat with my friend Chris Marget (@ChrisMarget) this week for the first time in a long while. It was good to catch up with all the things that have been going on and reminisce about the good old days. One of the topics that came up during our conversation was around working inside big organizations and the way that change processes are built.

I worked at IBM as an intern 20 years ago and the process to change things even back then was arduous. My experience with it was the deployment procedures to set up a new laptop. When I arrived the task took an hour and required something like five reboots. By the time I left we had changed that process and gotten it down to half an hour and only two reboots. However, before we could get the new directions approved as the procedure I had to test it and make sure that it was faster and produced the same result. I was frustrated but ultimately learned a lot about the glacial pace of improvements in big organizations.

Slow and Steady Finishes the Race

Change processes work to slow down the chaos that comes from having so many things conspiring to cause disaster. Probably the most famous change management framework is the Information Technology Infrastructure Library (ITIL). That little four-letter word has caused a massive amount of headaches in the IT space. Stage 3 of ITIL is the one that deals with changes in the infrastructure. There’s more to ITIL overall, including asset management and continual improvement, but usually anyone that takes ITIL’s name in vain is talking about the framework for change management.

This isn’t going to be a post about ITIL specifically but about process in general. What is your current change management process? If you’re in a medium to large sized shop you probably have a system that requires you to submit changes, get the evaluated and approved, and then implemented on a schedule during a downtime window. If you’re in a small shop you probably just make changes on the fly and hope for the best. If you work in DevOps you probably call them “deployments” and they happen whenever someone pushes code. Whatever the actual name for the process is you have one whether you realize it or not.

The true purpose of change management is to make sure what you’re doing to the infrastructure isn’t going to break anything. As frustrating as it is to have to go through the process every time the process is the reason why. You justify your changes and evaluate them for impact before scheduling them. As opposed to something that can be termed as “Change and find out” kind of methodologies.

Process is ugly and painful and keeps you from making simple mistakes. If every part of a change form needs to be filled out you’re going to complete it to make sure you have all the information that is needed. If the change requires you to validate things in a lab before implementation then it’s forcing you to confirm that it’s not going to break anything along the way. There’s even a process exception for emergency changes and such that are more focused on getting the system running as opposed to other concerns. But whatever the process is it is designed to save you.

ITIL isn’t a pain in the ass on accident. It’s purposely built to force your justify and document at every step of the process. It’s built to keep you from creating disaster by helping you create the paper trail that will save you when everything goes wrong.

Saving Your Time But Not Your Sanity

I used to work with a great engineer name John Pross. John wrote up all the documentation for our migrations between versions of software, including Novell NetWare and Novell Groupwise. When it came time to upgrade our office Groupwise server there was some hesitation on the part of the executive suite because they were worried we were going to run into an error and lock them out of their email. The COO asked John if he had a process he followed for the migration. John’s response was perfect in my mind:

“Yes, and I treat every migration like the first one.”

What John meant is that he wasn’t going to skip steps or take shortcuts to make things go faster. Every part of the procedure was going to be followed to the letter. And if something came up that didn’t match what he thought the output should have been it was going to stop until he solved that issue. John was methodical like that.

People like to take shortcuts. It’s in our nature to save time and energy however we can. But shortcuts are where the change process starts falling apart. If you do something different this time compared to the last ten times you’ve done it because you’re in a hurry or you think this might be more efficient without testing it you’re opening yourself up for a world of trouble. Maybe not this time but certainly down the road when you try to build on your shortcut even more. Because that’s the nature of what we do.

As soon as you start cutting corners and ignoring process you’re going to increase the risk of creating massive issues rapidly. Think about something as simple as the Windows Server 2003 shutdown dialog box. People used to reboot a server on a whim. In Windows 2003, the server had a process that required you to type in a reason why you were manually shutting the server down from the console. Most people that rebooted the server fell into two camps: Those that followed their process and typed in the reason for the reboot and those that just typed “;Lea;lksjfa;ldkjfadfk” as the reason and then were confused six months from now when doing the post-mortem on an issue and cursing their snarky attitude toward reboot documentation.

Saving the Day

Change process saves you in two ways. The first is really apparent: it keeps you from making mistakes. By forcing you to figure out what needs to happen along the way and document the whole process from start to finish you have all the info you need to make things successful. If there’s an opportunity to catch mistakes along the way you’re going to have every opportunity to do that.

The second way change process saves you is when it fails. Yes, no process is perfect and there are more than a few times when the best intentions coupled with a flaw in the process created a beautiful disaster that gave everyone lots of opportunity to learn. The question always comes back to what was learned in that process.

Bad change failures usually lead to a sewer pipe of blame being pointed in your direction. People use process failures as a change to deflect blame and avoid repercussions for doing something they shouldn’t have or trying to increase their stock in the company. The truly honest failure analysis doesn’t blame anyone but the failed process and tries to find a way to fix it.

Chris told me in our conversation that he loved ITIL at one of his former jobs because every time it failed it led to a meaningful change in the process to avoid failure in the future. These are the reasons why blameless post-mortem discussions are so important. If the people followed the process and the process the people aren’t at fault. The process is incorrect or flawed and needs to be adjusted.

It’s like a recipe. If the instructions tell you to cook something for a specific amount of time and it’s not right, who is to blame? Is it you because you did what you were told? Is the recipe? Is it the instructions? If you start with the idea that you did the process right and start trying to figure out where the process is wrong you can fix the process for next time. Maybe you used a different kind of ingredient that needs more time. Or you made it thinner than normal and that meant cooking it too long this time. Whatever the result, you end up documenting the process and changing things for the future to prevent that mistake from happening again.

Of course, just like all good frameworks, change processes shouldn’t be changed without analysis. Because changing something just to save time or take a shortcut defeats the whole purpose! You need to justify why changes are necessary and prove they provide the same benefit with no additional exposure or potential loss. Otherwise you’re back to making changes and hoping you don’t get burned this time.


Tom’s Take

ITIL didn’t really become a popular thing until after I left IBM but I’m sure if I were still there I’d be up to my eyeballs in it right now. Because ITIL was designed to keep keyboard cowboys like me from doing things we really shouldn’t be doing. Change management process are designed to save us at every step of the way and make us catch our errors before they become outages. The process doesn’t exist to make our lives problematic. That’s like saying a seat belt in a car only exists to get in my way. It may be a pain when you’re dealing with it regularly but when you need it you’re going to wish you’d been using it the whole time. Trust in the process and you will be saved.

Sharing Failure as a Learning Model

Earlier this week there was a great tweet from my friends over at Juniper Networks about mistakes we’ve made in networking:

It got some interactions with the community, which is always nice, but it got me to thinking about how we solve problems and learn from our mistakes. I feel that we’ve reached a point where we’re learning from the things we’ve screwed up but we’re not passing it along like we used to.

Write It Down For the Future

Part of the reason why I started my blog was to capture ideas that had been floating in my head for a while. Troubleshooting steps or perhaps even ideas that I wanted to make sure I didn’t forget down the line. All of it was important to capture for the sake of posterity. After all, if you didn’t write it down did it even happen?

Along the way I found that the posts that got significant traction on my site were the ones that involved mistakes. Something I’d done that caused an issue or something I needed to look up through a lot of sources that I distilled down into an easy reference. These kinds of posts are the ones that fly right up to the top of the Google search results. They are how people know you. It could be a terminology post like defining trunks. Or perhaps it’s a question about why your SPF modules are working in a switch.

Once I realized that people loved finding posts that solved problems I made sure to write more of them down. If I found a weird error message I made sure to figure out what it was and then put it up for everyone to find. When I documented weird behaviors of BPDUGuard and BPDUFilter that didn’t match the documentation I wrote it all down, including how I’d made a mistake in the way that I interpreted things. It was just part of the experience for me. Documenting my failures and my learning process could help someone in the future. My hope was that someone in the future would find my post and learn from it like I had.

Chit Chat Channels

It used to be that when you Googled error messages you got lots of results from forum sites or Reddit or other blogs detailing what went wrong and how you fixed it. I assume that is because, just like me, people were doing their research and figuring out what went wrong and then documenting the process. Today I feel like a lot of that type of conversation is missing. I know it can’t have gone away permanently because all networking engineerings make mistakes and solve problems and someone has to know where that went, right?

The answer came to me when I read a Reddit post about networking message boards. The suggestions in the comments weren’t about places to go to learn more. Instead, they linked to Slack channels or Discord servers where people talk about networking. That answer made me realize why the discourse around problem solving and learning from mistakes seems to have vanished.

Slack and Discord are great tools for communication. They’re also very private. I’m not talking about gatekeeping or restrictions on joining. I’m talking about the fact that the conversations that happen there don’t get posted anywhere else. You can join, ask about a problem, get advice, try it, see it fail, try something else, and succeed all without ever documenting a thing. Once you solve the problem you don’t have a paper trail of all the things you tried that didn’t work. You just have the best solution that you did and that’s that.

You know what you can’t do with Slack and Discord? Search them through Google. The logs are private. The free tiers remove messages after a fashion. All that knowledge disappears into thin air. Unlike the Wisdom of the Ancients the issues we solve in Slack are gone as soon as you hit your message limit. No one learns from the mistakes because it looks like no one has made them before.

Going the Extra Mile

I’m not advocating for removing Slack and Discord from our daily conversations. Instead, I’m proposing that when we do solve a hard problem or we make a mistake that others might learn from we say something about it somewhere that people can find it. It could be a blog post or a Reddit thread or some kind of indexable site somewhere.

Even the process of taking what you’ve done and consolidating it down into something that makes sense can be helpful. I saw X, tried Y and Z, and ended up doing B because it worked the best of all. Just the process of how you got to B through the other things that didn’t work will go a long way to help others. Yes, it can be a bit humbling and embarrassing to publish something that admits you that you made a mistake. But It’s also part of the way that we learn as humans. If others can see where we went and understand why that path doesn’t lead to a solution then we’ve effectively taught others too.


Tom’s Take

It may be a bit self-serving for me to say that more people need to be blogging about solutions and problems and such, but I feel that we don’t really learn from it unless we internalize it. That means figuring it out and writing it down. Whether it’s a discussion on a podcast or a back-and-forth conversation in Discord we need to find ways to getting the words out into the world so that others can build on what we’ve accomplished. Google can’t search archives that aren’t on the web. If we want to leave a legacy for the DenverCoder10s of the future that means we do the work now of sharing our failures as well as our successes and letting the next generation learn from us.

The Mystery of Known Issues

I’ve spent the better part of the last month fighting a transient issue with my home ISP. I thought I had it figure out after a hardware failure at the connection point but it crept back up after I got back from my Philmont trip. I spent a lot of energy upgrading my home equipment firmware and charting the seemingly random timing of the issue. I also called the technical support line and carefully explained what I was seeing and what had been done to work on the problem already.

The responses usually ranged from confused reactions to attempts to reset my cable modem, which never worked. It took several phone calls and lots of repeated explanations before I finally got a different answer from a technician. It turns out there was a known issue with the modem hardware! It’s something they’ve been working on for a few weeks and they’re not entirely sure what the ultimate fix is going to be. So for now I’m going to have to endure the daily resets. But at least I know I’m not going crazy!

Issues for Days

Known issues are a way of life in technology. If you’ve worked with any system for any length of time you’ve seen the list of things that aren’t working or have weird interactions with other things. Given the increasing amount of interactions that we have with systems that are becoming more and more dependent on things it’s a wonder those known issue lists are miles long by now.

Whether it’s a bug or an advisory or a listing of an incompatibility on a site, the nature of all known issues is the same. They are things that don’t work that we can’t fix yet. They could be on a list of issues to resolve or something that may never be able to be fixed. The key is that we know all about them so we can plan around them. Maybe it’s something like a bug in a floating point unit that causes long division calculations to be inaccurate to a certain number of decimal places. If you know what the issue is you know how to either plan around it or use something different. Maybe you don’t calculate to that level of precision. Maybe you do that on a different system with another chip. Whatever the case, you need to know about the issue before you can work around it.

Not all known issues are publicly known. They could involve sensitive information about a system. Perhaps the issue itself is a potential security risk. Most advisories about remote exploits are known issues internally at companies before they are patched. While they aren’t immediately disclosed they are eventually found out when the patch is released or when someone discovers the same issue outside of the company researchers. Putting these kinds of things under an embargo of sorts isn’t always bad if it protects from a wider potential to exploit them. However, the truth must eventually come out or things can’t get resolved.

Knowing the Unknown

What happens when the reasons for not disclosing known problems are less than noble? What if the reasoning behind hiding an issue has more to do with covering up bad decision making or saving face or even keeping investors or customers from fleeing? Welcome to the dark side of disclosure.

When I worked from Gateway 2000 back in the early part of the millennium, we had a particularly nasty known issue in the system. The ultimate root cause was that the capacitors on a series of motherboards were made with poor quality controls or bad components and would swell and eventually explode, causing the system to come to a halt. The symptoms manifested themselves in all manner of strange ways, like race conditions or random software errors. We would sometimes spend hours troubleshooting an unrelated issue only to find out the motherboard was affected with “bad caps”.

The issue was well documented in the tech support database for the affected boards. Once we could determine that it was a capacitor issue it was very easy to get the parts replaced. Getting to that point was the trick, though. Because at the top of the article describing the problem was a big, bold statement:

Do Not Tell The Customer This Is A Known Issue!!!

What? I can’t tell them that their system has an issue that we need to correct before everything pops and shuts it down for good? I can’t even tell them what to look for specifically when we open the case? Have you ever tried to tell a 75-year-old grandmother to look for something “strange” in a computer case? You get all kinds of fun answers!

We ended up getting creative in finding ways to look for those issues and getting them replaced where we could. When I moved on to my next job working for a VAR, I found out some of those same machines had been sold to a customer. I opened the case and found bad capacitors right away. I told my manager and explained the issue and we started getting them replaced under warranty as soon as the first sign of problems happened. After the warranty expired we kept ordering good boards from suppliers until we were able to retire all of those bad machines. If I hadn’t have known about the bad cap issue from my help desk time I never would have known what to look for.

Known issues like these are exactly the kind of thing you need to tell your customers about. It’s something that impacts their computer. It needs to be fixed. Maybe the company didn’t want to have to replace thousands of boards at once. Maybe they didn’t want to have to admit they cut corners when they were buying the parts and now the money they saved is going to haunt them in increased support costs. Whatever the reason it’s not the fault of the customer that the issue is present. They should have the option to get things fixed properly. Hiding what has happened is only going to create stress for the relations between consumer and provider.

Which brings me back to my issue from above. Maybe it wasn’t “known” when I called the first time. But by the third or fourth time I called about the same thing they should have been able to tell me it’s a known problem with this specific behavior and that a fix is coming soon. The solution wasn’t to keep using the first-tier support fixes of resets or transfers to another department. I would have appreciated knowing it was an issue so I didn’t have to spend as much time upgrading and isolating and documenting the hell out of everything just to exclude other issues. After all, my troubleshooting skills haven’t disappeared completely!

Vendors and providers, if you have a known issue you should admit it. Be up front. Honestly will get you far in this world. Tell everyone there’s a problem and you’re working on a fix that you don’t have just yet. It may not make the customer happy at first but they’ll understand a lot more than hiding it for days or weeks while you scramble to fix it without telling anyone. If that customer has more than a basic level of knowledge about systems they’ll probably be able to figure it out anyway and then you’re going to be the one with egg on your face when they tell you all about the problem you don’t want to admit you have.


Tom’s Take

I’ve been on both sides of this fence before in a number of situations. Do we admit we have a known problem and try to get it fixed? Or do we get creative and try to hide it so we don’t have to own up to the uncomfortable questions that get asked about bad development or cutting corners? The answer should always be to own up to things. Make everyone aware of what’s going on and make it right. I’d rather deal with an honest company working hard to make things better than a dishonest vendor that miraculously manages to fix things out of nowhere. An ounce of honestly prevents a pound of bad reputation.

Document The First Time, Every Time

2053fountain_pen

Imagine you’re deep into a massive issue. You’ve been troubleshooting for hours trying to figure out why something isn’t working. You’ve pulled in resources to help and you’re on the line with the TAC to try and get a resolution. You know this has to be related to something recent because you just got notified about it yesterday. You’re working through logs and configuration setting trying to gain insights into what went wrong. That’s when the TAC engineer hits you with with an armor-piecing question:

When did this start happening?

Now you’re sunk. When did you first start seeing it? Was it happening before and no one noticed? Did a tree fall in the forest and no one was around to hear the sound? What is the meaning of life now?

It’s not too hard to imagine the above scenario because we’ve found ourselves in it more times than we can count. We’ve started working on a problem and traced it back to a root cause only to find out that the actual inciting incident goes back even further than that. Maybe the symptoms just took a while to show up. Perhaps someone unknowingly “fixed” the issue with a reboot or a process reload over and over again until it couldn’t work any longer. How do we find ourselves in this mess? And how do we keep it from happening?

Quirky Worky

Glitches happen. Weird bugs crop up temporarily. It happens every day. I had to reboot my laptop the other day after being up for about two months because of a series of weird errors that I couldn’t resolve. Little things that weren’t super important that eventually snowballed into every service shutting down and forcing me to restart. But what was the cause? I can’t say for sure and I can’t tell you when it started. Because I just ignored the little glitches until the major ones forced me to do something.

Unless you work in security you probably ignore little strange things that happen. Maybe an application takes twice as long to load one morning. Perhaps you click on a link and it pops up a weird page before going through to the actual website. You could even see a storage notification in the logs for a router that randomly rebooted itself for no reason in the middle of the night. Occasional weirdness gets dismissed by us because we’ve come to expect that things are just going to act strangely. I once saw a quote from a developer that said, “If you think building a steady-state machine is easy, just look at how many issues are solved with a reboot.”

We tend to ignore weirdness unless it presents itself as a more frequent issue. If a router reboots itself once we don’t think much about it. The fourth time it reboots itself in a day we know we’ve got a problem we need to fix. Could we have solved it when the first reboot happened? Maybe. But we also didn’t know we were looking at a pattern of behavior either. Human brains are wired to look for patterns of things and pick them out. It’s likely an old survival trait of years gone by that we apply to current technology.

Notice that I said “unless you work in security”. That’s because the security folks have learned over many years and countless incidents that nothing is truly random or strange. They look for odd remote access requests or strange configuration changes on devices. They wonder why random IP addresses from outside the company are trying to access protected systems. Security professionals treat every random thing as a potential problem. However, that kind of behavior also demonstrates the downside of analyzing every little piece of information for potential threats. You quickly become paranoid about everything and spend a lot of time and energy trying to make sense out of potential nonsense. Is it any wonder that many security pros find themselves jumping at every little shadow in case it’s hiding a beast?

Middle Ground of Mentions

On the one hand, we have systems people that dismiss weirdness until it’s a pattern. On the other we have security pros that are trying to make patterns out of the noise. I’m sure you’re probably wondering if there has to be some kind of middle ground to ensure we’re keeping track of issues without driving ourselves insane.

In fact, there is a good policy that you need to get into the habit of doing. You need to write it all down somewhere. Yes, I’m talking about the dreaded documentation monster. The kind of thing that no one outside of developers likes to do. The mean, nasty, boring process of taking the stuff in your brain and putting it down somewhere so someone can figure out what you’re thinking without the need to read your mind.

You have to write it down because you need to have a record to work from if something goes wrong later. One of the greatest features I’ve ever worked with that seems to be ignored by just about everyone is the Windows Shutdown Reason dialog box in Windows Server 2003 and above. Rebooting a box? You need to write in why and give a good justification. That way if someone wants to know why the server was shut off at 11:15am on a Tuesday they can examine the logs. Unfortunately in my experience the usual reason for these shutdowns was either “a;lkjsdfl;kajsdf” or “because I am doing it”. Which aren’t great justifications for later.

You don’t have to be overly specific with your documentation but you need to give enough detail so that later you can figure out if this is part of a larger issue. Did an application stop responding and need to be restarted? Jot that down. Did you need to kill a process to get another thing running again? Write down that exact sentence. If you needed to restart a router and you ended up needing to restore a configuration you need to jot that part down too. Because you may not even realize you have an issue until you have documentation to point it out.

I can remember doing a support call years ago with a customer and in conversation he asked me if I knew much about Cisco routers. I chuckled and said I knew a bit. He said that he had one that he kept having to copy the configuration files to every time it restarted because it came up blank. He even kept a console cable plugged into it for just that reason. Any CCNA out there knows that’s probably a config register issue so I asked when it started happening. The person told me at least a year ago. I asked if anyone had to get into the router because of a forgotten password or some other lockout. He said that he did have someone come out a year and a half prior to reset a messed up password. Ever since then he had to keep putting the configuration back in. Sure enough, the previous tech hadn’t reset the config register. One quick fix and the customer was very happy to not have to worry about power outages any longer.

Documenting when things happen means you can build a timeline per device or per service to understand when things are acting up. You don’t even need to figure it out yourself. The magic of modern systems lies in machine learning. You may think to yourself that machine learning is just fancy linear regression at this point and you would be right more often than not. But one thing linear regression is great at doing is surfacing patterns of behavior for specific data points. If your router reboots on the third Wednesday of every month precisely at 3:33am the ML algorithms will pick up on that and tell you about it. But that’s only if your system catches the reboot through logs or some other record keeping. That’s why you have to document all your weirdness. Because the ML systems can analyze what they don’t know about.


Tom’s Take

I love writing. And I hate documentation. Documentation is boring, stuffy, and super direct. It’s like writing book reports over and over again. I’d rather write a fun blog post or imagine an exciting short story. However, documentation of issues is critical to modern organizations because these issues can spiral out of hand before you know it. If you don’t write it down it didn’t happen. And you need to know when it happened if you hope to prevent it in the future.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.

How Should We Handle Failure?

I had an interesting conversation this week with Greg Ferro about the network and how we’re constantly proving whether a problem is or is not the fault of the network. I postulated that the network gets blamed when old software has a hiccup. Greg’s response was:

Which led me to think about why we have such a hard time proving the innocence of the network. And I think it’s because we have a problem with applications.

Snappy Apps

Writing applications is hard. I base this on the fact that I am a smart person and I can’t do it. Therefore it must be hard, like quantum mechanics and figuring out how to load the dishwasher. The few people I know that do write applications are very good at turning gibberish into usable radio buttons. But they have a world of issues they have to deal with.

Error handling in applications is a mess at best. When I took C Programming in college, my professor was an actual coder during the day. She told us during the error handling portion of the class that most modern programs are a bunch of statements wrapped in if clauses that jump to some error condition in if they fail. All of the power of your phone apps is likely a series of if this, then that, otherwise this failure condition statements.

Some of these errors are pretty specific. If an application calls a database, it will respond with a database error. If it needs an SSL certificate, there’s probably a certificate specific error message. But what about those errors that aren’t quite as cut-and-dried? What if the error could actually be caused by a lot of different things?

My first real troubleshooting on a computer came courtesy of Lucasarts X-Wing. I loved the game and played the training missions constantly until I was able to beat them perfectly just a few days after installing it. However, when I started playing the campaign, the program would crash to DOS when I started the second mission with an out of memory error message. In the days before the Internet, it took a lot of research to figure out what was going on. I had more than enough RAM. I met all the specs on the side of the box. What I didn’t know and had to learn is that X-Wing required the use of Expanded Memory (EMS) to run properly. Once I decoded enough of the message to find that out, I was able to change things and make the game run properly. But I had to know that the memory X-Wing was complaining about was EMS, not the XMS RAM that I needed for Windows 3.1.

Generic error messages are the bane of the IT professional’s existence. If you don’t believe me, ask yourself how many times you’ve spent troubleshooting a network connectivity issue for a server only to find that DNS was misconfigured or down. In fact, Dr. House has a diagnosis for you:

Unintuitive Networking

If the error messages are so vague when it comes to network resources and applications, how are we supposed to figure out what went wrong and how to fix it? I mean, humans have been doing that for years. We take the little amount of information that we get from various sources. Then we compile, cross-reference, and correlate it all until we have enough to make a wild guess about what might be wrong. And when that doesn’t work, we keep trying even more outlandish things until something sticks. That’s what humans do. We will in the gaps and come up with crazy ideas that occasionally work.

But computers can’t do that. Even the best machine learning algorithms can’t extrapolate data from a given set. They need precise inputs to find a solution. Think of it like a Google search for a resolution. If you don’t have a specific enough error message to find the problem, you aren’t going to have enough data to provide a specific fix. You will need to do some work on your own to find the real answer.

Intent-based networking does little to fix this right now. All intent-based networking products are designed to create a steady state from a starting point. No matter how cool it looks during the demo or how powerful it claims to be, it will always fall over when it fails. And how easily it falls over is up to the people that programmed it. If the system is a black box with no error reporting capabilities, it’s going to fail spectacularly with no hope of repair beyond an expensive support contract.

It could be argued that intent-based networking is about making networking easier. It’s about setting things up right the first time and making them work in the future. Yet, no system in a steady state works forever. With the possible exception of the tar pitch experiment, everything will fail eventually. And if the control system in place doesn’t have the capability to handle the errors being seen and propose a solution, then your fancy provisioning system is worth about as much as a car with no seat belts.

Fixing The Issues

So, how do we fix this mess? Well, the first thing that we need to do is scold the application developers. They’re easy targets and usually the cause behind all of these issues. Instead of giving us a vague error message about network connectivity, we need more like lp0 Is On Fire. We need to know what was going on when the function call failed. We need a trace. We need context around process to be able to figure out why something happened.

Now, once we get all these messages, we need a way to correlate them. Is that an NMS or some other kind of platform? Is that an incident response system? I don’t know the answer there, but you and your staff need to know what’s going on. They need to be able to see the problems as they occur and deal with the errors. If it’s DNS (see above), you need to know that fast so you can resolve the problem. If it’s a BGP error or a path problem upstream, you need to know that as soon as possible so you can assess the impact.

And lastly, we’ve got to do a better job of documenting these things. We need to take charge of keeping track of what works and what doesn’t. And keeping track of error messages and their meanings as we work on getting app developers to give us better ones. Because no one wants to be in the same boat as DenverCoder9.


Tom’s Take

I made a career out of taking vague error messages and making things work again. And I hated every second of it. Because as soon as I became the Montgomery Scott of the Network, people started coming to me with every more bizarre messages. Becoming the Rosetta Stone for error messages typed into Google is not how you want to spend your professional career. Yet, you need to understand that even the best things fail and that you need to be prepared for it. Don’t spend money building a very pretty facade for your house if it’s built on quicksand. Demand that your orchestration systems have the capabilities to help you diagnose errors when everything goes wrong.

 

Is It Really Always The Network?

Keep Calm and Blame the Network

Image from Thomas LaRock

I had a great time over the last month writing a series of posts with my friend John Herbert (@MrTugs) over on the SolarWinds Geek Speak Blog. You can find the first post here. John and I explored the idea that people are always blaming the network for a variety of issues that are often completely unrelated to the actual operation of the network. It was fun writing some narrative prose for once, and the feedback we got was actually pretty awesome. But I wanted to take some time to explain the rationale behind my madness. Why is it that we are always blaming the network?!?

Visibility Is Vital

Think about all the times you’ve been working on an application and things start slowing down. What’s the first thing you think of? If it’s a standalone app, it’s probably some kind of processing lag or memory issues. But if that app connects to any other thing, whether it be a local network or a remote network via the Internet, the first culprit is the connection between systems.

It’s not a large logical leap to make. We have to start by assuming the the people that made the application knew what they were doing. If hundreds of other people aren’t having this problem, it must not be with the application, right? We’ve already started eliminating the application as the source of the issues even before we start figuring out what went wrong.

People will blame the most visible part of the system for issues. If that’s a standalone system sealed off from the rest of the world, it obviously must be the application. However, we don’t usually build these kinds of walled-off systems any longer. Almost every application in existence today requires a network connection of some kind. Whether it’s to get updates or to interact with other data or people, the application needs to talk to someone or maybe even everyone.

I’ve talked before about the need to make the network more of a “utility”. Part of the reason for this is that it lowers the visibility of the network to the rest of the IT organization. Lower visibility means fewer issues being incorrectly blamed on the network. It also means that the network is going to be doing more to move packets and less to fix broken application issues.

Blame What You Know

If your network isn’t stable to begin with, it will soon become the source of all issues in IT even if the network has nothing to do with the app. That’s because people tend to identify problem sources based on their own experience. If you are unsure of that, work on a consumer system helpdesk sometime and try and keep track of the number of calls that you get that were caused by “viruses” even if there’s no indication that this is a virus related issue. It’s staggering.

The same thing happens in networking and other enterprise IT. People only start troubleshooting problems from areas of known expertise. This usually breaks down by people shouting out solutions like, “I saw this once so it must be that! I mean, the symptoms aren’t similar and the output is totally different, but it must be that because I know how to fix it!”

People get uncomfortable when they are faced with troubleshooting something unknown to them. That’s why they fall back on familiar things. And if they constantly hear how the network is the source of all issues, guess what the first thing to get blamed is going to be?

Network admins and engineers have to fight a constant battle to disprove the network as the source of issues. And for every win they get it can come crashing down when the network is actually the problem. Validating the fears of the users is the fastest way to be seen as the issue every time.

Mean Time To Innocence

As John and I wrote the pieces for SolarWinds, what we wanted to show is that a variety of issues can look like network things up front but have vastly different causes behind the scenes. What I felt was very important for the piece was the distinguish the the main character, Amanda, go beyond the infamous Mean Time To Innocence (MTTI) metric. In networking, we all too often find ourselves going so far as to prove that it’s not the network and then leave it there. As soon as we’re innocent, it’s done.

Cross-function teams and other DevOps organizations don’t believe in that kind of boundary. Lines between silos blur or are totally absent. That means that instead of giving up once you prove it’s not your problem, you need to work toward fixing what’s at issue. Fix the problem, not the blame. If you concentrate on fixing the problems, it will soon become noticeable that networking team may not always be the problem. Even if the network is at fault, the network team will work to fix it and any other issues that you see.


Tom’s Take

I love the feedback that John and I have gotten so far on the series we wrote. Some said it feels like a situation they’ve been in before. Others have said that they applaud the way things were handled. I know that the narrative allows us to bypass some of the unsavory things that often happen, like argument and political posturing inside an organization to make other department heads look bad when a problem strikes. But what we really wanted to show is that the network is usually the first to get blamed and the last to keep its innocence in a situation like this.

We wanted to show that it’s not always the network. And the best way for you to prove that in your own organization is to make sure the network isn’t just innocent, but helpful in solving as many problems as possible.