Monitoring Other People’s Problems

It’s Always the Network is a refrain that causes operations teams to shudder. No matter what your flavor of networking might be it’s always your fault. Even if the actual problem is DNS, a global BGP outage, or even some issue with the SaaS provider. Why do we always get blamed? And how can you prevent this from happening to you?

User Utopia

Users don’t know about the world outside of their devices. As soon as they click on something in a browser window they expect it to work. It’s a lot like ordering a package and having it delivered. It’s expected that the package arrives. You don’t concern yourself with the details of how it needs to be shipped, what routes it will take, and how factors that exist half a world away could cause disruptions to your schedule at home.

The network is the same to the users. If something doesn’t work with a website or a remote application it must be the “network” that is at fault. Because your users believe that everything not inside of their computer is the network. Networking is the way that stuff happens everywhere else. As professionals we know the differences between the LAN, the WAN, the cloud, and all points in between. However your users do not.

Have you heard about MTTI? It’s a humorous metric that we like to quote now that stands for “mean time to innocence”. Essentially we’re saying that we’re racing to find out why it’s not our problem. We check the connections, look at the services in our network, and then confidently explain to the user that someone else is at fault for what happened and we can’t fix that. Sounds great in theory, right?

Would you proudly proclaim that it’s someone else’s fault to the CEO? The Board of Directors? How about to the President of the US? Depending on how important the person you are speaking to might be you might change the way you present the information. Regular users might get “sorry, not my fault” but the CEO could get “here’s the issue and we need to open a ticket to fix it”. Why the varying service levels? Is it because regular users aren’t as important as the head of the company? Or is it because we don’t want to put in the extra effort for a knowledge workers that we would gladly do for the person that in ultimately going to approve a raise for us if we do extra work? Do you see the fallacy here?

Keeping Your Eyes Peeled

The answer, of course, is that every user at least deserves an explanation of where the problem is. You might be bent on justifying that it’s not in your network so you don’t have to do any more work but you also did enough work to place the blame outside of your area of control. Why not go one step further? You could check a dashboard for services on a cloud provider to see if they’re degraded. Or do a quick scan of news sites or social media to see if it’s a more widespread issue. Even checking a service to see if the site is down for more than just you is more information than just “not my problem”.

The habit of monitoring other services allows you to do two things very quickly. First is that it lowers the all-important MTTI metric. If you can quickly scan the likely culprits for outages you can then say with certainty that it’s out of your control. However, the biggest benefit to monitoring services outside of your organization that you rely on is that you’ll have forewarning for issues in your own organization. If your users come to you and say that they can’t get to Microsoft 365 and you already know that’s because they messed up a WAN router configuration you’ll head off needless blame. You’ll also know where to look next time to understand challenges.

If you’re already monitoring the infrastructure in your organization it only makes sense to monitor the infrastructure that your users need outside of it. Maybe you can’t get AWS to send you the SNMP strings you need to watch all their internal switches but you can easily pull information from their publicly available sources to help determine where the issues might lie. That helps you in the same way that monitoring switch ports in your organization helps you today. You can respond to issues before they get reported so you have a clear picture of what the impact will be.

Tom’s Take

When I worked for a VAR one of my biggest pet peeves was “not my problem” thinking. Yes, it may not be YOUR problem but it is still A problem for the user. Rather than trying to shift the blame let’s put our heads together to investigate where the problem lies and create a fix or a workaround for it. Because the user doesn’t care who is to blame. They just want the issue resolved. And resolution doesn’t come from passing the buck.

Problem Replication or Why Do We Need to Break It Again?

There was a tweet the other day that posited that we don’t “need” to replicate problems to solve them. Ultimately the reason for the tweet was that a helpdesk refused to troubleshoot the problem until they could replicate the issue and the tweeter thought that wasn’t right. It made me start thinking about why troubleshooters are so bent on trying to make something happen again before we actually start trying to fix an issue.

The Definition of Insanity

Everyone by now has heard that the definition of insanity is doing the same thing over and over again and expecting a different result. While funny and a bit oversimplified the reality of troubleshooting is that you are trying to make it do something different with the same inputs. Because if you can make it do the same thing over and over again you’re closer to the root cause of the issue.

Root cause is the key to problem solving. If you don’t fix what’s actually wrong you are only dealing with symptoms and not issues. However, you can’t know what’s actually wrong until you can make it happen more than once. That’s because you have to narrow the actual issue down from all of the symptoms. If you do something and get the same output every time then you’re on the right track. However, if you change the inputs and get the same output you haven’t isolated the issue yet. It’s far too common in troubleshooting to change a bunch of things all at once and not realize what actually fixed the problem.

Without proper isolation you’re just throwing darts. As the above comic illustrates sometimes the victory is causing a different error message. That means you’re manipulating the right variables or inputs. It also means you know what knobs need to be turned in order to get closer to the root cause and the solution. So repeatability in this case is key because it means you’re not trying to fix things that aren’t really broken. You’re also narrowing the scope of the fixes by eliminating things that don’t need to be monitored.

Resource Utilization

How long does it take to write code? A couple of hours? A day? Does it take longer if you’re trying to do tests and make sure your changes don’t break anything else? What about shipping that code as part of the DevOps workflow? Or an out-of-band patch? Do you need to wait until the next maintenance window to implement the changes? These are all extremely valuable questions to ask to figure out the impact of changes on your environment.

Now multiply all of those factors by the number of things you tried that didn’t work. The number of times you thought you had the issue solved only to get the same error output. You can see how wasting hours of a programming team working on things can add up to the company’s bottom line quickly. Nothing is truly free or a sunk cost. You’re still paying people to work on something, whether it’s fixing buggy code or writing a new feature for your application. If they’re spending more of their time tracking down bugs that happen infrequently with no clear root cause are they being used to their highest potential?

What happens when that team can’t fix the issue because it’s too hard to cause it again? Do they bring in another team? Pay for support calls? Do you throw up your hands and just live with it? I’ve been involved in all of these situations in the past. I’ve tried to replicate issues to no avail only to be told that they’ll just live with the bug because they can’t afford to have me work on it any longer. I’ve also spent hours trying to replicate issues only to find out later that the message has only ever appeared once and no one knows what was going on when it happened. I might have been working for a VAR at the time but my time was something the company tracked closely. Knowing up front the issue had only ever happened once might have changed the way I worked on the issue.

I can totally understand why people would be upset that a help desk closed a ticket because they couldn’t replicate an issue. However, I also think that prudence should be involved in any structured troubleshooting practice. Having been on the other side of the Severity One all-hands-on-deck emergencies only to find a simple issue or a non-repeatable problem in the past I can say that having that many resources committed to a non-issue is also maddening as a tech and a manager of technical people.


Tom’s Take

Do you need to replicate a problem to troubleshoot it? No. But you also don’t need to follow a recipe to bake a cake. However, it does help immensely if you do because you’re going to get a consistent output that helps you figure out issues if they arise. The purpose of replicating issues when troubleshooting isn’t nefarious or an attempt to close tickets faster. Instead it’s designed to help speed problem isolation and ensure that resources are tasked with the right fixes to the issues instead of just spraying things and hoping one of them gets the thing taken care of.

Intelligence and Wisdom

I spent the last week at the Philmont Leadership Challenge in beautiful Cimarron, NM. I had the chance to learn a bit more about servant leadership and work on my outdoor skills a little. I also had some time to reflect on an interesting question posed to me by one of the members of my crew.

He asked me, “You seem wise. How did you get so wise?” This caught me flat-flooted for a moment because I’d never really considered myself to be a very wise person. Experienced perhaps but not wise like Yoda or Gandalf. So I answered him as I thought more about it.

Intelligence is knowing what to do. Wisdom is knowing what not to do.

The more I thought about that quote the more I realized the importance of the distinction.

Basic Botany

There’s another saying that people tweeted back at me when I shared the above quote. It’s used in the context of describing Intelligence and Wisdom for Dungeons and Dragons roleplaying:

Intelligence is knowing that a tomato is a fruit. Wisdom is not putting tomatoes in a fruit salad.

It’s silly and funny but it gets right to the point and is a different version of my other observation. Intelligence is all about the acquisition of knowledge. Think about your certification journey. You spend all your time learning the correct commands for displaying routing tables or how to debug a device and figure out what’s going on. You memorize arguments so you can pass the exam without the use of the question mark.

Intelligence is focused on making sure you have all the knowledge you can ever use. Whether it’s an arcane spell book or Routing TCP/IP Volume 1 you’re working with the kinds of information that you need to ingest in order to get things done. Think of it like a kind of race to amass a fortune in facts.

However, as pointed out above, intelligence is often lacking in the application of that knowledge. Assembling a storehouse full of facts doesn’t do much to help you when it comes to applying that knowledge to produce outcomes. You can be a very intelligent person and still not know what to do with it. You may have heard someone say that a person is “book smart” or is lacking is “common sense”. These are both ways to say that someone is intelligent by maybe not wise.

Applied Science

If intelligence is all about acquisition of knowledge then wisdom is focused on application. Just because you know what commands are used to debug a router doesn’t mean you need to use them all the time. There are apocryphal stories of freshly minted CCIEs walking in to the data center for an ISP and entering debug ip packet detail on the CLI only to watch the switch completely exhaust itself and crash in the middle of the day. The command was correct for what they wanted to accomplish. What was missing was the applied knowledge that a busy switch wouldn’t be able to handle the additional processing load of that much data being streamed to the console.

Wisdom isn’t gained from reading a book. It’s gained from applying knowledge to situations. No application of that knowledge is going to be perfect every time. You’re going to make mistakes. You’re going to do things that cause problems. You’re going to need to fix mistakes and learn as you go. Along the way you’re going to find a lot of things that don’t work for a given situation. That’s where wisdom is gained. You’re not failing. You’re learning what doesn’t work so you don’t apply it incorrectly.

A perfect example of this came just a couple of days ago. The power in my office was out which meant the Internet was down for everyone. A major crisis for sure! I knew I needed to figure out what was going on so I started the troubleshooting process. I knew how electricity worked and what needed to be checked. Along the way I kept working and trying to figure out where the problem was. The wisdom I gained along the way from working with series circuits and receptacles helped me narrow things down to one wall socket that had become worn out and needed to be replaced. More wisdom told me to make sure the power was turned off before I started working on the replacement.

I succeeded not because I knew what to do as much as knowing what not to do when applying the knowledge. I didn’t have to check plugs I knew weren’t working. I knew things could be on different circuits. I knew I didn’t have to mess with working sockets either. All the knowledge of resistance and current would only serve me correctly if I knew where to put it and how to work around the issues I saw in the application of that knowledge.

Not every piece of wisdom comes from unexpected outcomes. It’s often just as important to do something that works and see the result so you can remember it for the next time. The wisdom comes in knowing how to apply that knowledge and why it only works in certain situations. If you’ve every worked with someone that troubleshoots really complex problems with statements like “I tried this crazy thing once and it worked” you know exactly how this can be done.


Tom’s Take

Intelligence has always been my strong point. I read a lot and retain knowledge. I’m at home when I’m recalling trivia or absorbing new facts. However I’ve always worried that I wasn’t very wise. I make simple mistakes and often forget how to use the information I have on hand. However, when I shared the quote above I finally realized that all those mistakes were just me learning how to apply the knowledge I’d gained over time. Wisdom isn’t a passage in a book. It’s not a fact. It’s about knowing when to use it and when not to use it. It’s about learning in a different way that matters just as much as all the libraries in the world.

Redundancy Is Not Resiliency

Most people carry a spare tire in their car. It’s there in case you get a flat and need to change the tire before you can be on your way again. In my old VAR job I drove a lot away from home and to the middle of nowhere so I didn’t want to rely on roadside assistance. Instead I just grabbed the extra tire out of the back if I needed it and went on my way. However, the process wasn’t entirely hitless. Even the pit crew for a racing team needs time to change tires. I could probably get it done in 20 minutes with appropriate cursing but those were 20 minutes that I wasn’t doing anything else beyond fixing a tire.

Spare tires are redundant. You have an extra thing to replace something that isn’t working. IT operations teams are familiar with redundant systems. Maybe you have a cold spare on the shelf for a switch that might go down. You might have a cold or warm data center location for a disaster. You could even have redundant devices in your enterprise to help you get back in to your equipment if something causes it to go offline. Well, I say that you do. If you’re Meta/Facebook you didn’t have them this time last year.

Don’t mistake redundancy for resilience though. Like the tire analogy above you’re not going to be able to fix a flat while you’re driving. Yes, I’ve seen the crazy video online of people doing that but aside from stunt driving you’re going to have to take some downtime from your travel to fix the tire. Likewise, a redundant setup that includes cold spares or out-of-band devices that are connected directly to your network could incur downtime if they go offline and lock you out of your management system. Facebook probably thought their out-of-band control system worked just fine. Until it didn’t.

The Right Gear for Resilience

At Networking Field Day 29 last week we were fortunate to see Opengear present again for the second time. I’m familiar with them from all the way back at Networking Field Day 2 in 2011 so their journey through the changes of networking over the past decade has been great to see. They make out-of-band devices and they make them well. They’re one of the companies that immediately spring to mind when you think about solutions for getting access to devices outside the normal network access method.

As a VAR there were times that I needed to make calls to locations in order to reboot devices or get console access to fix an issue. Whether it was driving 3 hours to press F1 to clear a failed power supply message or racing across town to restore phone service after locking myself out of an SSH session there are numerous reasons why having actual physical access to the console is important. Until we perfect quantum teleportation we’re going to have to solve that problem with technology. Here’s a video from the Networking Field Day session that highlights some of the challenges and solutions that Opengear has available:

Ryan Hogg brings up a great argument here for redundancy versus resiliency. Are you managing your devices in-band? Or do you have a dedicated management network? And what’s your backup for that dedicated network if something goes offline? VLAN separation isn’t good enough. In the event of a failure mode, such as a bridging loop or another attack that takes a switch offline you won’t be able to access the management network if you can’t sent packets through it. If the tire goes flat you’re stopped until it’s fixed.

Opengear solves this problem in a number of ways. The first is of course providing a secondary access method to your network. Opengear console devices have a cellular backup function that can allow you to access them in the event of an outage, either from the internal network or from the Internet going down. I can think of a couple of times in my career where I would have loved to have been able to connect to a cellular interface to undo a change that just happened that had unintended consequences. Sometimes reload in 5 doesn’t quite do the job. Having a reliable way to connect to your core equipment makes life easy for network operating systems that don’t keep from making mistakes.

However, as mentioned, redundancy is not resiliency. It’s not enough for us to have access to fix the problem while everything is down and the world is on fire. We may be able to get back in and fix the issue without needing to drive to the site but the users in that location are still down while we’re working. SD-WAN devices have offered us diverse connectivity options for a number of years now. If the main broadband line goes down just fail back to the cellular connection for critical traffic until it comes back up. Easy to do now that we have the proper equipment to create circuit diversity.

As outlined in the video above, Opengear has the same capability as well. If you don’t have a fancy SD-WAN edge device you can still configure Opengear console devices to act as a secondary egress point. It’s as simple as configuring the network with an IP SLA to track the state of the WAN link and installing the cellular route in the routing table if that link goes down. Once configured your users can get out on the backup link while you’re coming in to fix whatever caused the issue. If it’s the ISP having the issue you can log a ticket and confirm things are working on-site without having to jump in a car to see what your users see.

Resilience Really Matters

One of the things that Opengear has always impressed me with is their litany of use cases for their devices. I can already think of a ton of ways that I could implement something like this for customers that need monitoring and resilient connectivity options. Remote offices are an obvious choice but so too are locations with terrible connectivity options.

If you are working in a location with spotty connectivity you can easily deploy an Opengear device to keep an eye on the network and/or servers as well as providing an extra way for the site to get back online in the event of an issue. If the WAN circuit goes down you can just hop over to the cellular link until you get it fixed. Opengear will tell you something happened and you can log into the Lighthouse central management system to go there and collect data. If configured correctly your users may not even realize they’re offline! We’re almost at the point of changing the tire while we’re driving.


Tom’s Take

I am often asked if I miss working on networking equipment since I rarely touch it these days. As soon as I’m compelled to answer that question I remember all the times I had to drive somewhere to fix an issue. Wasted time can never be recovered. Resources cost money whether it’s money for a device or time spent going to fix one. I look at the capabilities that a company like Opengear has today and I wish I had those fifteen years ago and could deploy them to places I know needed them. In my former line of work redundant things were hard to come by. Resilient options were much more appealing because they offered more than just a back plan in case of failure. You need to pick resiliency every time because otherwise you’re going to be losing time replacing that tire when you could be rolling along fixing it instead.


Disclaimer

Opengear was a presenter at Networking Field Day 29 on September 7, 2022. I am an employee of Tech Field Day, which is the company that managed the event. This blog post represents my own personal thoughts about the presentation and is not the opinion of Tech Field Day. Opengear did not provide compensation for this post or ask for editorial approval. This post is my perspective alone.

Helpdesk Skills Fit the Bill

RedLEDKeyboard

I saw a great tweet yesterday from Swift on Security that talked about helpdesk work and how it’s nothing to be ashamed of:

I thought it was especially important to call this out for my readers. I’ve made no secret that my first “real” job in IT was on the national helpdesk for Gateway Computers through a contractor. I was there for about six months before I got a job doing more enterprise-type support work. And while my skills today are far above what I did when I started out having customers search for red floppy disks and removing helper apps in MSCONFIG, the basics that I learned there are things I still carry with me today.

Rocket Science

Most people have a negative outlook on helpdesk work. They see it as entry-level and not worth admitting to. They also don’t quite understand just how hard it is to do this kind of work. It may be different today compared to when I did it twenty years ago through the advances in technology but there is one group of people that know just how hard it is to do remote work on a system you can’t physically touch:

NASA Engineers

If you think it’s easy to just jump in front of a keyboard and fix an issue with Outlook or Chrome, try doing it without being able to see what you’re looking at. Now try and describe what you want to do without technical jargon and without using the term “thingie”. Now do it with the patience that it takes to tell someone more than once what they need to do and how to get them to read you the error message verbatim so you can figure out what’s actually going on.

Inbound phone support is a lot like fix a space probe. You have a lag time between commands being sent and confirmation. You can’t make it too complicated because the memory of the thing you’re working with isn’t unlimited. And you have to hope what you did actually fixes the issue because you may not get a second chance at it. It might sound a touch hyperbolic but I’d argue that phone support teaches the skills needed to communicate effectively and concisely to people less technical than you. You know, executives.

Joking aside, being able to distill the issue from incomplete information and formulate a response while simultaneously describing it in non-technical terms to ease implementation is the kind of skill you don’t read out of a book. Don’t believe me? Call your parents and help them update their DNS settings. Unless your parents are in IT I doubt it’s going to be a fun call. Tech is obtuse on purpose to keep people from playing with it. The people that can cut through the jargon and keep users going are magicians.

Visualize the Result

I have a gift for analogies. They aren’t always 100% accurate but I can usually relate some kind of a technology to some real-world non-technical thing to help users understand what’s going on. That gift was developed in the six months I spent trying to explain bad capacitors, startup TSRs, and exactly what an FDISK, Format, Reload process did to the computer. It’s a super valuable skill to have outside of tech too.

Go ask your doctor to explain how the endocrine system works in non-medical terms. I bet they can because they have a way to describe most of the body functions in terms that someone who has never taken human anatomy can understand. You can’t explain technical subjects with technical terms to just anyone. And even the technical people that know what you’re saying often have a hard time visualizing what you’re talking about. That’s how we ended up with the “Internet as a series of tubes” meme.

Where does this skill come in handy? How about teaching advanced tech to your junior staff? Or educating the executives in a board meeting? Or even just trying to tell the users in your organization why it’s not the network this time? If you can only talk in technical jargon without giving analogies or helping them visualize what you’re saying you will only end up with confused angry users that think you’re being patronizing.

The helpdesk forces you to get better at being descriptive with shorter words and more visual descriptions. I’d argue that my helpdesk time made my writing better because it reminds me not to be loquacious when I should be economical in my word choices and sentence lengths. If you can’t help your user figure out what you’re doing they’ll never learn how to be better at telling people what they need support for.

Documenting All the Things

The last skill that the helpdesk instilled in me is documentation. I’m a bad note taker. I get distracted (thanks ADHD) and often forget to write down important things. Helpdesk work is a place where you absolutely need to write everything down. Ask questions and record info. Dig into error messages and find out when they started happening. Listen to what people are telling you and don’t jump to conclusions until you’ve written it all down.

I once had a call where the user told me that they couldn’t get into their computer. I thought this was an easy fix. I spent ten minutes walking through the Windows XP login process. Nothing worked. I was getting frustrated and was about to reboot to safe mode and start erasing password hashes. By chance the user mentioned that they saw the pop up for the buzzer and it still wasn’t working. After I questioned them on this I realized that they could log in to WinXP just fine. The “buzzer popup” was AOL’s login screen. They had an issue with the modem, not the operating system. If I’d asked more questions about when the problem started and what they saw instead of just jumping to the wrong conclusion I could have saved a ton of pain and wasted effort.

Likewise, when you’re doing IT work you have to write it all down. Troubleshooting makes a lot of sense here but so does implementation or other kinds of design work. If you’re doing a wireless survey on site and forget to write down what the walls are made out of you may find yourself with an ineffective design or, worse having to spend extra time and money fixing it on the fly because those reinforced concrete walls are way better at signal attenuation than you recalled from a spotty memory.

Get in the habit of taking good notes from the start and summarizing them when needed to weed out the working things from the non-working things. Honestly, having a record of all the steps you took to arrive at your conclusion helps in the future if the issue happens again or you find yourself at a brick wall and need to retrace your steps to figure out where you took the wrong turn. If you record it all you won’t need to spend too much effort scratching your head to figure out what you were thinking.


Tom’s Take

Every job teaches you skills you can carry forward. Even the worst job in the world teaches you something you can take with you. Some jobs have a negative stigma and really shouldn’t. Like Swift said above you should accentuate the positive aspects of your journey. Don’t look at the helpdesk as a slog through the trenches. See yourself as a NASA rocket scientist that can talk to normal people and document like a champion. That’s a career anyone would be proud to have.

A Gift Guide for Sanity In Your Home IT Life

If you’re reading my blog you’re probably the designated IT person for your family or immediate friend group. Just like doctors that get called for every little scrape or plumbers that get the nod when something isn’t draining over the holidays, you are the one that gets an email or a text message when something pops up that isn’t “right” or has a weird error message. These kinds of engagements are hard because you can’t just walk away from them and you’re likely not getting paid. So how can you be the Designated Computer Friend and still keep your sanity this holiday season?

The answer, dear reader, is gifts. If you’re struggling to find something to give your friends that says “I like you but I also want to reduce the number of times that you call me about your computer problems” then you should definitely read on for more info! Note that I’m not going to fill this post will affiliate links or plug products that have sponsored anything. Instead, I’m going to just share the classes or types of devices that I think are the best way to get control of things.

Step 1: Infrastructure Upgrades

When you go visit your parents for Thanksgiving or some other holiday check in, are they still running the same wireless network they got when they got their high-speed Internet? Is their Wi-Fi SSID still the default with the password printed on the side of the router/modem combo? Then you’re going to want to upgrade their experience to keep your sanity for the next few holidays.

The first thing you need to do it get control of their wireless setup. You need to get some form of wireless access point that wasn’t manufactured in the early part of the century. Most of the models on the market have Wi-Fi 6 support now. You don’t need to go crazy with a Wi-Fi 6E model for your loved ones right now because none of their devices will support it. You just need something more modern with a user interface that wasn’t written to look like Windows 3.1.

You also need to see about an access point that is controlled via a cloud console. If you’re the IT person in the group you probably already use some form control for your home equipment. You don’t need a full Meraki or Juniper Mist setup to lighten your load. That is, unless you already have one of those dashboards set up and you have spare capacity. Otherwise you could look at something like Ubiquiti as a middle ground.

Why a cloud controller AP? Because then you can log in and fix things or diagnose issues without needing to spend time talking to less technical users. You can find out if they have an unstable Internet connection or change SSID passwords at the drop of a hat. You can even set up notifications for those remote devices to let you know when a problem happens so you can be ready and waiting for the call. And you can keep tabs on necessary upgrades and such so you aren’t fielding calls when the next major exploit comes out and your parents call you asking if they’re going to get infected by this virus. You can just tell them they’re up-to-date and good to go. The other advantage of this method is that when you upgrade your own equipment at home you can just waterfall the old functional gear down to them and give them a “new to you” upgrade that they’ll appreciate.

Step 2: Device Upgrades

My dad was notorious for using everything long past the point of needing to be retired. It’s the way he was raised. If there’s a hole you patch it. If it breaks you fix it. If that fix doesn’t work you wrap it in duct tape and use it until it crumbles to dust. While that works for the majority of things out there it does cause issues with technology far too often.

He had a iPad that he loved. He didn’t use it all day, every day but he did use it frequently enough to say that it was his primary computing device. It was a fourth-generation device, so it fell out of fashion a few years ago. When he would call me and ask me questions about why it was behaving a certain way or why he couldn’t download some new app from the App Store I would always remind him that he had an older device that wasn’t fast enough or new enough to run the latest programs or even operating software. This would usually elicit a grumble or two and then we would move on.

If you’re the Designated IT Person and you spend half your time trying to figure out what versions of OS and software are running on a device, do yourself a favor and invest in a new device for your users just to ease the headaches. If they use a tablet as their primary computing device, which many people today do, then just buy a new one and help them migrate all the data across to the new one while you’re eating turkey or opening presents.

Being on later hardware ensures that the operating system is the latest version with all the patches for security that are needed to keep your users safe. It also means you’re not trying to figure out what the last supported version of the software was that works with the rest of the things. I’ve played this game trying to get an Apple Watch to connect to an older phone with mismatched software as well as trying to get support for newer wireless security on older laptops with very little capability to do much more than WPA1. The amount of hours I burned trying to make the old junk work with the new stuff would have been better served just buying a new version of the same old thing and getting all their software moved over. Problems seem to just disappear when you are running on something that was manufactured within the last five years.

Step 3: Help Them Remember

This is probably my biggest request: Forgotten passwords. Either it’s the forgotten Apple ID or maybe the wireless network password. My parents and in-laws forget the passwords they need to log into things all the time. I finally broke down and taught them how to use a password management tool a few years ago and it made all the difference in the world. Now, instead of them having to remember what their password was for a shopping site they can just set it to automatically fill everything in. And since they only need to remember the master password for their app they don’t have to change it.

Better yet, most of these apps have a secure section for notes. So all those other important non-password things that seem to come up all the time are great to put in here. Social Security Numbers, bank account numbers, and so much more can be put in one central location and made easy to access. The best part? If you make it a shared vault you can request access to help them out when they forget how to get in. Or you can be designated as a trusted party that can access the account in the event of a tragedy. Getting your loved ones used to using password vaults now makes it much easier to have them storing important info there in case something happens down the road that requires you to jump in without their interaction. Trust me on this.


Tom’s Take

Your loved ones don’t need knick knacks and useless junk. If you want to show them you love them, give them the gift of not having to call you every couple of days because they can’t remember the wireless password or because they keep getting this error that says their app isn’t support on this device. Invest in your sanity and their happiness by giving them something that works and that has the ability for you to help manage it from the background. If you can make it stable and useful and magically work before they call you with a problem you’re going to find yourself a happier person in the years to come.

The Process Will Save You

I had the opportunity to chat with my friend Chris Marget (@ChrisMarget) this week for the first time in a long while. It was good to catch up with all the things that have been going on and reminisce about the good old days. One of the topics that came up during our conversation was around working inside big organizations and the way that change processes are built.

I worked at IBM as an intern 20 years ago and the process to change things even back then was arduous. My experience with it was the deployment procedures to set up a new laptop. When I arrived the task took an hour and required something like five reboots. By the time I left we had changed that process and gotten it down to half an hour and only two reboots. However, before we could get the new directions approved as the procedure I had to test it and make sure that it was faster and produced the same result. I was frustrated but ultimately learned a lot about the glacial pace of improvements in big organizations.

Slow and Steady Finishes the Race

Change processes work to slow down the chaos that comes from having so many things conspiring to cause disaster. Probably the most famous change management framework is the Information Technology Infrastructure Library (ITIL). That little four-letter word has caused a massive amount of headaches in the IT space. Stage 3 of ITIL is the one that deals with changes in the infrastructure. There’s more to ITIL overall, including asset management and continual improvement, but usually anyone that takes ITIL’s name in vain is talking about the framework for change management.

This isn’t going to be a post about ITIL specifically but about process in general. What is your current change management process? If you’re in a medium to large sized shop you probably have a system that requires you to submit changes, get the evaluated and approved, and then implemented on a schedule during a downtime window. If you’re in a small shop you probably just make changes on the fly and hope for the best. If you work in DevOps you probably call them “deployments” and they happen whenever someone pushes code. Whatever the actual name for the process is you have one whether you realize it or not.

The true purpose of change management is to make sure what you’re doing to the infrastructure isn’t going to break anything. As frustrating as it is to have to go through the process every time the process is the reason why. You justify your changes and evaluate them for impact before scheduling them. As opposed to something that can be termed as “Change and find out” kind of methodologies.

Process is ugly and painful and keeps you from making simple mistakes. If every part of a change form needs to be filled out you’re going to complete it to make sure you have all the information that is needed. If the change requires you to validate things in a lab before implementation then it’s forcing you to confirm that it’s not going to break anything along the way. There’s even a process exception for emergency changes and such that are more focused on getting the system running as opposed to other concerns. But whatever the process is it is designed to save you.

ITIL isn’t a pain in the ass on accident. It’s purposely built to force your justify and document at every step of the process. It’s built to keep you from creating disaster by helping you create the paper trail that will save you when everything goes wrong.

Saving Your Time But Not Your Sanity

I used to work with a great engineer name John Pross. John wrote up all the documentation for our migrations between versions of software, including Novell NetWare and Novell Groupwise. When it came time to upgrade our office Groupwise server there was some hesitation on the part of the executive suite because they were worried we were going to run into an error and lock them out of their email. The COO asked John if he had a process he followed for the migration. John’s response was perfect in my mind:

“Yes, and I treat every migration like the first one.”

What John meant is that he wasn’t going to skip steps or take shortcuts to make things go faster. Every part of the procedure was going to be followed to the letter. And if something came up that didn’t match what he thought the output should have been it was going to stop until he solved that issue. John was methodical like that.

People like to take shortcuts. It’s in our nature to save time and energy however we can. But shortcuts are where the change process starts falling apart. If you do something different this time compared to the last ten times you’ve done it because you’re in a hurry or you think this might be more efficient without testing it you’re opening yourself up for a world of trouble. Maybe not this time but certainly down the road when you try to build on your shortcut even more. Because that’s the nature of what we do.

As soon as you start cutting corners and ignoring process you’re going to increase the risk of creating massive issues rapidly. Think about something as simple as the Windows Server 2003 shutdown dialog box. People used to reboot a server on a whim. In Windows 2003, the server had a process that required you to type in a reason why you were manually shutting the server down from the console. Most people that rebooted the server fell into two camps: Those that followed their process and typed in the reason for the reboot and those that just typed “;Lea;lksjfa;ldkjfadfk” as the reason and then were confused six months from now when doing the post-mortem on an issue and cursing their snarky attitude toward reboot documentation.

Saving the Day

Change process saves you in two ways. The first is really apparent: it keeps you from making mistakes. By forcing you to figure out what needs to happen along the way and document the whole process from start to finish you have all the info you need to make things successful. If there’s an opportunity to catch mistakes along the way you’re going to have every opportunity to do that.

The second way change process saves you is when it fails. Yes, no process is perfect and there are more than a few times when the best intentions coupled with a flaw in the process created a beautiful disaster that gave everyone lots of opportunity to learn. The question always comes back to what was learned in that process.

Bad change failures usually lead to a sewer pipe of blame being pointed in your direction. People use process failures as a change to deflect blame and avoid repercussions for doing something they shouldn’t have or trying to increase their stock in the company. The truly honest failure analysis doesn’t blame anyone but the failed process and tries to find a way to fix it.

Chris told me in our conversation that he loved ITIL at one of his former jobs because every time it failed it led to a meaningful change in the process to avoid failure in the future. These are the reasons why blameless post-mortem discussions are so important. If the people followed the process and the process the people aren’t at fault. The process is incorrect or flawed and needs to be adjusted.

It’s like a recipe. If the instructions tell you to cook something for a specific amount of time and it’s not right, who is to blame? Is it you because you did what you were told? Is the recipe? Is it the instructions? If you start with the idea that you did the process right and start trying to figure out where the process is wrong you can fix the process for next time. Maybe you used a different kind of ingredient that needs more time. Or you made it thinner than normal and that meant cooking it too long this time. Whatever the result, you end up documenting the process and changing things for the future to prevent that mistake from happening again.

Of course, just like all good frameworks, change processes shouldn’t be changed without analysis. Because changing something just to save time or take a shortcut defeats the whole purpose! You need to justify why changes are necessary and prove they provide the same benefit with no additional exposure or potential loss. Otherwise you’re back to making changes and hoping you don’t get burned this time.


Tom’s Take

ITIL didn’t really become a popular thing until after I left IBM but I’m sure if I were still there I’d be up to my eyeballs in it right now. Because ITIL was designed to keep keyboard cowboys like me from doing things we really shouldn’t be doing. Change management process are designed to save us at every step of the way and make us catch our errors before they become outages. The process doesn’t exist to make our lives problematic. That’s like saying a seat belt in a car only exists to get in my way. It may be a pain when you’re dealing with it regularly but when you need it you’re going to wish you’d been using it the whole time. Trust in the process and you will be saved.

What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

Sharing Failure as a Learning Model

Earlier this week there was a great tweet from my friends over at Juniper Networks about mistakes we’ve made in networking:

It got some interactions with the community, which is always nice, but it got me to thinking about how we solve problems and learn from our mistakes. I feel that we’ve reached a point where we’re learning from the things we’ve screwed up but we’re not passing it along like we used to.

Write It Down For the Future

Part of the reason why I started my blog was to capture ideas that had been floating in my head for a while. Troubleshooting steps or perhaps even ideas that I wanted to make sure I didn’t forget down the line. All of it was important to capture for the sake of posterity. After all, if you didn’t write it down did it even happen?

Along the way I found that the posts that got significant traction on my site were the ones that involved mistakes. Something I’d done that caused an issue or something I needed to look up through a lot of sources that I distilled down into an easy reference. These kinds of posts are the ones that fly right up to the top of the Google search results. They are how people know you. It could be a terminology post like defining trunks. Or perhaps it’s a question about why your SPF modules are working in a switch.

Once I realized that people loved finding posts that solved problems I made sure to write more of them down. If I found a weird error message I made sure to figure out what it was and then put it up for everyone to find. When I documented weird behaviors of BPDUGuard and BPDUFilter that didn’t match the documentation I wrote it all down, including how I’d made a mistake in the way that I interpreted things. It was just part of the experience for me. Documenting my failures and my learning process could help someone in the future. My hope was that someone in the future would find my post and learn from it like I had.

Chit Chat Channels

It used to be that when you Googled error messages you got lots of results from forum sites or Reddit or other blogs detailing what went wrong and how you fixed it. I assume that is because, just like me, people were doing their research and figuring out what went wrong and then documenting the process. Today I feel like a lot of that type of conversation is missing. I know it can’t have gone away permanently because all networking engineerings make mistakes and solve problems and someone has to know where that went, right?

The answer came to me when I read a Reddit post about networking message boards. The suggestions in the comments weren’t about places to go to learn more. Instead, they linked to Slack channels or Discord servers where people talk about networking. That answer made me realize why the discourse around problem solving and learning from mistakes seems to have vanished.

Slack and Discord are great tools for communication. They’re also very private. I’m not talking about gatekeeping or restrictions on joining. I’m talking about the fact that the conversations that happen there don’t get posted anywhere else. You can join, ask about a problem, get advice, try it, see it fail, try something else, and succeed all without ever documenting a thing. Once you solve the problem you don’t have a paper trail of all the things you tried that didn’t work. You just have the best solution that you did and that’s that.

You know what you can’t do with Slack and Discord? Search them through Google. The logs are private. The free tiers remove messages after a fashion. All that knowledge disappears into thin air. Unlike the Wisdom of the Ancients the issues we solve in Slack are gone as soon as you hit your message limit. No one learns from the mistakes because it looks like no one has made them before.

Going the Extra Mile

I’m not advocating for removing Slack and Discord from our daily conversations. Instead, I’m proposing that when we do solve a hard problem or we make a mistake that others might learn from we say something about it somewhere that people can find it. It could be a blog post or a Reddit thread or some kind of indexable site somewhere.

Even the process of taking what you’ve done and consolidating it down into something that makes sense can be helpful. I saw X, tried Y and Z, and ended up doing B because it worked the best of all. Just the process of how you got to B through the other things that didn’t work will go a long way to help others. Yes, it can be a bit humbling and embarrassing to publish something that admits you that you made a mistake. But It’s also part of the way that we learn as humans. If others can see where we went and understand why that path doesn’t lead to a solution then we’ve effectively taught others too.


Tom’s Take

It may be a bit self-serving for me to say that more people need to be blogging about solutions and problems and such, but I feel that we don’t really learn from it unless we internalize it. That means figuring it out and writing it down. Whether it’s a discussion on a podcast or a back-and-forth conversation in Discord we need to find ways to getting the words out into the world so that others can build on what we’ve accomplished. Google can’t search archives that aren’t on the web. If we want to leave a legacy for the DenverCoder10s of the future that means we do the work now of sharing our failures as well as our successes and letting the next generation learn from us.

The Mystery of Known Issues

I’ve spent the better part of the last month fighting a transient issue with my home ISP. I thought I had it figure out after a hardware failure at the connection point but it crept back up after I got back from my Philmont trip. I spent a lot of energy upgrading my home equipment firmware and charting the seemingly random timing of the issue. I also called the technical support line and carefully explained what I was seeing and what had been done to work on the problem already.

The responses usually ranged from confused reactions to attempts to reset my cable modem, which never worked. It took several phone calls and lots of repeated explanations before I finally got a different answer from a technician. It turns out there was a known issue with the modem hardware! It’s something they’ve been working on for a few weeks and they’re not entirely sure what the ultimate fix is going to be. So for now I’m going to have to endure the daily resets. But at least I know I’m not going crazy!

Issues for Days

Known issues are a way of life in technology. If you’ve worked with any system for any length of time you’ve seen the list of things that aren’t working or have weird interactions with other things. Given the increasing amount of interactions that we have with systems that are becoming more and more dependent on things it’s a wonder those known issue lists are miles long by now.

Whether it’s a bug or an advisory or a listing of an incompatibility on a site, the nature of all known issues is the same. They are things that don’t work that we can’t fix yet. They could be on a list of issues to resolve or something that may never be able to be fixed. The key is that we know all about them so we can plan around them. Maybe it’s something like a bug in a floating point unit that causes long division calculations to be inaccurate to a certain number of decimal places. If you know what the issue is you know how to either plan around it or use something different. Maybe you don’t calculate to that level of precision. Maybe you do that on a different system with another chip. Whatever the case, you need to know about the issue before you can work around it.

Not all known issues are publicly known. They could involve sensitive information about a system. Perhaps the issue itself is a potential security risk. Most advisories about remote exploits are known issues internally at companies before they are patched. While they aren’t immediately disclosed they are eventually found out when the patch is released or when someone discovers the same issue outside of the company researchers. Putting these kinds of things under an embargo of sorts isn’t always bad if it protects from a wider potential to exploit them. However, the truth must eventually come out or things can’t get resolved.

Knowing the Unknown

What happens when the reasons for not disclosing known problems are less than noble? What if the reasoning behind hiding an issue has more to do with covering up bad decision making or saving face or even keeping investors or customers from fleeing? Welcome to the dark side of disclosure.

When I worked from Gateway 2000 back in the early part of the millennium, we had a particularly nasty known issue in the system. The ultimate root cause was that the capacitors on a series of motherboards were made with poor quality controls or bad components and would swell and eventually explode, causing the system to come to a halt. The symptoms manifested themselves in all manner of strange ways, like race conditions or random software errors. We would sometimes spend hours troubleshooting an unrelated issue only to find out the motherboard was affected with “bad caps”.

The issue was well documented in the tech support database for the affected boards. Once we could determine that it was a capacitor issue it was very easy to get the parts replaced. Getting to that point was the trick, though. Because at the top of the article describing the problem was a big, bold statement:

Do Not Tell The Customer This Is A Known Issue!!!

What? I can’t tell them that their system has an issue that we need to correct before everything pops and shuts it down for good? I can’t even tell them what to look for specifically when we open the case? Have you ever tried to tell a 75-year-old grandmother to look for something “strange” in a computer case? You get all kinds of fun answers!

We ended up getting creative in finding ways to look for those issues and getting them replaced where we could. When I moved on to my next job working for a VAR, I found out some of those same machines had been sold to a customer. I opened the case and found bad capacitors right away. I told my manager and explained the issue and we started getting them replaced under warranty as soon as the first sign of problems happened. After the warranty expired we kept ordering good boards from suppliers until we were able to retire all of those bad machines. If I hadn’t have known about the bad cap issue from my help desk time I never would have known what to look for.

Known issues like these are exactly the kind of thing you need to tell your customers about. It’s something that impacts their computer. It needs to be fixed. Maybe the company didn’t want to have to replace thousands of boards at once. Maybe they didn’t want to have to admit they cut corners when they were buying the parts and now the money they saved is going to haunt them in increased support costs. Whatever the reason it’s not the fault of the customer that the issue is present. They should have the option to get things fixed properly. Hiding what has happened is only going to create stress for the relations between consumer and provider.

Which brings me back to my issue from above. Maybe it wasn’t “known” when I called the first time. But by the third or fourth time I called about the same thing they should have been able to tell me it’s a known problem with this specific behavior and that a fix is coming soon. The solution wasn’t to keep using the first-tier support fixes of resets or transfers to another department. I would have appreciated knowing it was an issue so I didn’t have to spend as much time upgrading and isolating and documenting the hell out of everything just to exclude other issues. After all, my troubleshooting skills haven’t disappeared completely!

Vendors and providers, if you have a known issue you should admit it. Be up front. Honestly will get you far in this world. Tell everyone there’s a problem and you’re working on a fix that you don’t have just yet. It may not make the customer happy at first but they’ll understand a lot more than hiding it for days or weeks while you scramble to fix it without telling anyone. If that customer has more than a basic level of knowledge about systems they’ll probably be able to figure it out anyway and then you’re going to be the one with egg on your face when they tell you all about the problem you don’t want to admit you have.


Tom’s Take

I’ve been on both sides of this fence before in a number of situations. Do we admit we have a known problem and try to get it fixed? Or do we get creative and try to hide it so we don’t have to own up to the uncomfortable questions that get asked about bad development or cutting corners? The answer should always be to own up to things. Make everyone aware of what’s going on and make it right. I’d rather deal with an honest company working hard to make things better than a dishonest vendor that miraculously manages to fix things out of nowhere. An ounce of honestly prevents a pound of bad reputation.