What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

Sharing Failure as a Learning Model

Earlier this week there was a great tweet from my friends over at Juniper Networks about mistakes we’ve made in networking:

It got some interactions with the community, which is always nice, but it got me to thinking about how we solve problems and learn from our mistakes. I feel that we’ve reached a point where we’re learning from the things we’ve screwed up but we’re not passing it along like we used to.

Write It Down For the Future

Part of the reason why I started my blog was to capture ideas that had been floating in my head for a while. Troubleshooting steps or perhaps even ideas that I wanted to make sure I didn’t forget down the line. All of it was important to capture for the sake of posterity. After all, if you didn’t write it down did it even happen?

Along the way I found that the posts that got significant traction on my site were the ones that involved mistakes. Something I’d done that caused an issue or something I needed to look up through a lot of sources that I distilled down into an easy reference. These kinds of posts are the ones that fly right up to the top of the Google search results. They are how people know you. It could be a terminology post like defining trunks. Or perhaps it’s a question about why your SPF modules are working in a switch.

Once I realized that people loved finding posts that solved problems I made sure to write more of them down. If I found a weird error message I made sure to figure out what it was and then put it up for everyone to find. When I documented weird behaviors of BPDUGuard and BPDUFilter that didn’t match the documentation I wrote it all down, including how I’d made a mistake in the way that I interpreted things. It was just part of the experience for me. Documenting my failures and my learning process could help someone in the future. My hope was that someone in the future would find my post and learn from it like I had.

Chit Chat Channels

It used to be that when you Googled error messages you got lots of results from forum sites or Reddit or other blogs detailing what went wrong and how you fixed it. I assume that is because, just like me, people were doing their research and figuring out what went wrong and then documenting the process. Today I feel like a lot of that type of conversation is missing. I know it can’t have gone away permanently because all networking engineerings make mistakes and solve problems and someone has to know where that went, right?

The answer came to me when I read a Reddit post about networking message boards. The suggestions in the comments weren’t about places to go to learn more. Instead, they linked to Slack channels or Discord servers where people talk about networking. That answer made me realize why the discourse around problem solving and learning from mistakes seems to have vanished.

Slack and Discord are great tools for communication. They’re also very private. I’m not talking about gatekeeping or restrictions on joining. I’m talking about the fact that the conversations that happen there don’t get posted anywhere else. You can join, ask about a problem, get advice, try it, see it fail, try something else, and succeed all without ever documenting a thing. Once you solve the problem you don’t have a paper trail of all the things you tried that didn’t work. You just have the best solution that you did and that’s that.

You know what you can’t do with Slack and Discord? Search them through Google. The logs are private. The free tiers remove messages after a fashion. All that knowledge disappears into thin air. Unlike the Wisdom of the Ancients the issues we solve in Slack are gone as soon as you hit your message limit. No one learns from the mistakes because it looks like no one has made them before.

Going the Extra Mile

I’m not advocating for removing Slack and Discord from our daily conversations. Instead, I’m proposing that when we do solve a hard problem or we make a mistake that others might learn from we say something about it somewhere that people can find it. It could be a blog post or a Reddit thread or some kind of indexable site somewhere.

Even the process of taking what you’ve done and consolidating it down into something that makes sense can be helpful. I saw X, tried Y and Z, and ended up doing B because it worked the best of all. Just the process of how you got to B through the other things that didn’t work will go a long way to help others. Yes, it can be a bit humbling and embarrassing to publish something that admits you that you made a mistake. But It’s also part of the way that we learn as humans. If others can see where we went and understand why that path doesn’t lead to a solution then we’ve effectively taught others too.


Tom’s Take

It may be a bit self-serving for me to say that more people need to be blogging about solutions and problems and such, but I feel that we don’t really learn from it unless we internalize it. That means figuring it out and writing it down. Whether it’s a discussion on a podcast or a back-and-forth conversation in Discord we need to find ways to getting the words out into the world so that others can build on what we’ve accomplished. Google can’t search archives that aren’t on the web. If we want to leave a legacy for the DenverCoder10s of the future that means we do the work now of sharing our failures as well as our successes and letting the next generation learn from us.

The Mystery of Known Issues

I’ve spent the better part of the last month fighting a transient issue with my home ISP. I thought I had it figure out after a hardware failure at the connection point but it crept back up after I got back from my Philmont trip. I spent a lot of energy upgrading my home equipment firmware and charting the seemingly random timing of the issue. I also called the technical support line and carefully explained what I was seeing and what had been done to work on the problem already.

The responses usually ranged from confused reactions to attempts to reset my cable modem, which never worked. It took several phone calls and lots of repeated explanations before I finally got a different answer from a technician. It turns out there was a known issue with the modem hardware! It’s something they’ve been working on for a few weeks and they’re not entirely sure what the ultimate fix is going to be. So for now I’m going to have to endure the daily resets. But at least I know I’m not going crazy!

Issues for Days

Known issues are a way of life in technology. If you’ve worked with any system for any length of time you’ve seen the list of things that aren’t working or have weird interactions with other things. Given the increasing amount of interactions that we have with systems that are becoming more and more dependent on things it’s a wonder those known issue lists are miles long by now.

Whether it’s a bug or an advisory or a listing of an incompatibility on a site, the nature of all known issues is the same. They are things that don’t work that we can’t fix yet. They could be on a list of issues to resolve or something that may never be able to be fixed. The key is that we know all about them so we can plan around them. Maybe it’s something like a bug in a floating point unit that causes long division calculations to be inaccurate to a certain number of decimal places. If you know what the issue is you know how to either plan around it or use something different. Maybe you don’t calculate to that level of precision. Maybe you do that on a different system with another chip. Whatever the case, you need to know about the issue before you can work around it.

Not all known issues are publicly known. They could involve sensitive information about a system. Perhaps the issue itself is a potential security risk. Most advisories about remote exploits are known issues internally at companies before they are patched. While they aren’t immediately disclosed they are eventually found out when the patch is released or when someone discovers the same issue outside of the company researchers. Putting these kinds of things under an embargo of sorts isn’t always bad if it protects from a wider potential to exploit them. However, the truth must eventually come out or things can’t get resolved.

Knowing the Unknown

What happens when the reasons for not disclosing known problems are less than noble? What if the reasoning behind hiding an issue has more to do with covering up bad decision making or saving face or even keeping investors or customers from fleeing? Welcome to the dark side of disclosure.

When I worked from Gateway 2000 back in the early part of the millennium, we had a particularly nasty known issue in the system. The ultimate root cause was that the capacitors on a series of motherboards were made with poor quality controls or bad components and would swell and eventually explode, causing the system to come to a halt. The symptoms manifested themselves in all manner of strange ways, like race conditions or random software errors. We would sometimes spend hours troubleshooting an unrelated issue only to find out the motherboard was affected with “bad caps”.

The issue was well documented in the tech support database for the affected boards. Once we could determine that it was a capacitor issue it was very easy to get the parts replaced. Getting to that point was the trick, though. Because at the top of the article describing the problem was a big, bold statement:

Do Not Tell The Customer This Is A Known Issue!!!

What? I can’t tell them that their system has an issue that we need to correct before everything pops and shuts it down for good? I can’t even tell them what to look for specifically when we open the case? Have you ever tried to tell a 75-year-old grandmother to look for something “strange” in a computer case? You get all kinds of fun answers!

We ended up getting creative in finding ways to look for those issues and getting them replaced where we could. When I moved on to my next job working for a VAR, I found out some of those same machines had been sold to a customer. I opened the case and found bad capacitors right away. I told my manager and explained the issue and we started getting them replaced under warranty as soon as the first sign of problems happened. After the warranty expired we kept ordering good boards from suppliers until we were able to retire all of those bad machines. If I hadn’t have known about the bad cap issue from my help desk time I never would have known what to look for.

Known issues like these are exactly the kind of thing you need to tell your customers about. It’s something that impacts their computer. It needs to be fixed. Maybe the company didn’t want to have to replace thousands of boards at once. Maybe they didn’t want to have to admit they cut corners when they were buying the parts and now the money they saved is going to haunt them in increased support costs. Whatever the reason it’s not the fault of the customer that the issue is present. They should have the option to get things fixed properly. Hiding what has happened is only going to create stress for the relations between consumer and provider.

Which brings me back to my issue from above. Maybe it wasn’t “known” when I called the first time. But by the third or fourth time I called about the same thing they should have been able to tell me it’s a known problem with this specific behavior and that a fix is coming soon. The solution wasn’t to keep using the first-tier support fixes of resets or transfers to another department. I would have appreciated knowing it was an issue so I didn’t have to spend as much time upgrading and isolating and documenting the hell out of everything just to exclude other issues. After all, my troubleshooting skills haven’t disappeared completely!

Vendors and providers, if you have a known issue you should admit it. Be up front. Honestly will get you far in this world. Tell everyone there’s a problem and you’re working on a fix that you don’t have just yet. It may not make the customer happy at first but they’ll understand a lot more than hiding it for days or weeks while you scramble to fix it without telling anyone. If that customer has more than a basic level of knowledge about systems they’ll probably be able to figure it out anyway and then you’re going to be the one with egg on your face when they tell you all about the problem you don’t want to admit you have.


Tom’s Take

I’ve been on both sides of this fence before in a number of situations. Do we admit we have a known problem and try to get it fixed? Or do we get creative and try to hide it so we don’t have to own up to the uncomfortable questions that get asked about bad development or cutting corners? The answer should always be to own up to things. Make everyone aware of what’s going on and make it right. I’d rather deal with an honest company working hard to make things better than a dishonest vendor that miraculously manages to fix things out of nowhere. An ounce of honestly prevents a pound of bad reputation.

Should We Embrace Points of Failure?

failure-pic-1

There was a tweet making the rounds this last week that gave me pause. Max Clark said that we should embrace single points of failure in the network. His post was an impassioned plea to networking rock stars out there to drop redundancy out of their networks and instead embrace these Single Points of Failure (SPoF). The main points Mr. Clark made boil down to a couple of major statements:

  1. Single-device networks are less complex and easier to manage and troubleshoot. Don’t have multiple devices when an all-in-one works better.
  2. Consumer-grade hardware is cheaper and easier to understand, therefore it’s better. Plus, if you need a backup you can just buy a second one and keep it on the shelf.

I’m sure more networking pros out there are practically bristling at these suggestions. Others may read through the original tweet and think this was a tongue-in-cheek post. Let’s look at the argument logically and understand why this has some merit but is ultimately flawed.

Missing Minutes Matter

I’m going to tackle the second point first. The idea that you can use cheaper gear and have cold standby equipment just sitting on the shelf is one that I’ve heard of many times in the past. Why pay more money to have a hot spare or a redundant device when you can just stock a spare part and swap it out when necessary? If your network design decisions are driven completely by cost then this is the most appealing thing you could probably do.

I once worked with a technology director that insisted that we forgo our usual configuration of RAID-5 with a hot spare drive in the servers we deployed. His logic was that the hot spare drive was spinning without actually doing anything. If something did go wrong it was just as easy to slip the spare drive in by taking it off the shelf and firing it up then instead of running the risk that the drive might fail in the server. His logic seemed reasonable enough but there was one variable that he wasn’t thinking about.

Time is always the deciding factor in redundancy planning. In the world of backup and disaster recovery they use the acronym RTO, which stands for Recovery Time Objective. Essentially, how long do you want your systems to be offline before the data is restored? Can you go days without getting your data back? Or do you need it back up and running within hours or even minutes? For some organizations the RTO could even be measured in mere seconds. Every RTO measurement adds additional complexity and cost.

If you can go days without your data then a less expensive tape solution is best because it is the cheapest per byte stored and lasts forever. If your RTO is minutes or less you need to add hardware that replicates changes or mirrors the data between sites to ensure there is always an available copy somewhere out there. Time is the deciding factor here, just as it is in the redundancy example above.

Can your network tolerate hours of downtime while you swap in a part from off the shelf? Remember that you’re going to need to copy the configuration over to it and ensure it’s back up and running. If it is a consumer-grade device there probably isn’t an easy way to console in and paste the config. Maybe you can upload a file from the web GUI but the odds are pretty good that you’re looking at downtime at least in the half-hour range if not more. If your office can deal with that then Max’s suggestions should work just fine.

For organizations that need to be back up and running in less than hours, you need to have fault tolerance in your network. Redundant paths for traffic or multiple devices to eliminate single points of failure are the only way to ensure that traffic keeps flowing in the event of a hardware failure. Sure, it’s more complicated to troubleshoot. But the time you spend making it work correctly is not time you’re going to spend copying configurations to a cold device while users and stakeholders are yelling at you to get things back online.

All-In-Wonder

Let’s look at the first point here. Single box solutions are better because they are simple to manage and give you everything you could need. Why buy a separate switch, firewall, and access point when you can get them all in one package? This is the small office / branch office model. SD-WAN has even started moving down this path for smaller deployments by pushing all the devices you need into one footprint.

It’s not unlike the TVs you can buy in the big box stores that have DVD players, VHS players, and even some streaming services built in. They’re easy to use because there are no extra wires to plug in and no additional remote controls to lose. Everything works from one central location and it’s simple to manage. The package is a great solution when you need to watch old VHS tapes or DVDs from your collection infrequently.

Of course, most people understand the drawbacks of this model. Those devices can break. They are much harder to repair when they’re all combined. Worse yet, if the DVD player breaks and you need to get it repaired you lose the TV completely during the process instead of just the DVD player. You also can’t upgrade the components individually. Want to trade out that DVD for a Blu-Ray player? You can’t unless you install one on its own. Want to keep those streaming apps up-to-date? Better hope the TV has enough memory to keep current. Event state-of-the-art streaming boxes will eventually be incapable of running the latest version of popular software.

All-in-one devices are best left to the edges of the network. They function well in offices with a dozen or so workers. If something goes bad on the device it’s easier to just swap the whole thing instead of trying to repair the individual parts. That same kind of mentality doesn’t work quite so well in a larger data center. The fact that most of these unified devices don’t take rack mounting ears or fit into a standard data center rack should be a big hint that they aren’t designed for use in a place that keeps the networking pieces off of someone’s desk.


Tom’s Take

I smiled a bit when I read the tweet that started this whole post. I’m sure that the networks that Max has worked on work much better with consumer all-in-one devices. Simple configurations and cold spares are a perfectly acceptable solution for law offices or tag agencies or other places that don’t measure their downtime in thousands of dollars per second. I’m not saying he’s wrong. I’m saying that his solution doesn’t work everywhere. You can’t run the core of an ISP with some SMB switches. You should run your three-person law office with a Cat6500. You need to decide what factors are the most important for you. Don’t embrace failure without thought. Figure out how tolerant you or your customers are of failure and design around it as best you can. Once you can do that you’ll have a much better idea of how to build your network with the fewest points of failure.

Document The First Time, Every Time

2053fountain_pen

Imagine you’re deep into a massive issue. You’ve been troubleshooting for hours trying to figure out why something isn’t working. You’ve pulled in resources to help and you’re on the line with the TAC to try and get a resolution. You know this has to be related to something recent because you just got notified about it yesterday. You’re working through logs and configuration setting trying to gain insights into what went wrong. That’s when the TAC engineer hits you with with an armor-piecing question:

When did this start happening?

Now you’re sunk. When did you first start seeing it? Was it happening before and no one noticed? Did a tree fall in the forest and no one was around to hear the sound? What is the meaning of life now?

It’s not too hard to imagine the above scenario because we’ve found ourselves in it more times than we can count. We’ve started working on a problem and traced it back to a root cause only to find out that the actual inciting incident goes back even further than that. Maybe the symptoms just took a while to show up. Perhaps someone unknowingly “fixed” the issue with a reboot or a process reload over and over again until it couldn’t work any longer. How do we find ourselves in this mess? And how do we keep it from happening?

Quirky Worky

Glitches happen. Weird bugs crop up temporarily. It happens every day. I had to reboot my laptop the other day after being up for about two months because of a series of weird errors that I couldn’t resolve. Little things that weren’t super important that eventually snowballed into every service shutting down and forcing me to restart. But what was the cause? I can’t say for sure and I can’t tell you when it started. Because I just ignored the little glitches until the major ones forced me to do something.

Unless you work in security you probably ignore little strange things that happen. Maybe an application takes twice as long to load one morning. Perhaps you click on a link and it pops up a weird page before going through to the actual website. You could even see a storage notification in the logs for a router that randomly rebooted itself for no reason in the middle of the night. Occasional weirdness gets dismissed by us because we’ve come to expect that things are just going to act strangely. I once saw a quote from a developer that said, “If you think building a steady-state machine is easy, just look at how many issues are solved with a reboot.”

We tend to ignore weirdness unless it presents itself as a more frequent issue. If a router reboots itself once we don’t think much about it. The fourth time it reboots itself in a day we know we’ve got a problem we need to fix. Could we have solved it when the first reboot happened? Maybe. But we also didn’t know we were looking at a pattern of behavior either. Human brains are wired to look for patterns of things and pick them out. It’s likely an old survival trait of years gone by that we apply to current technology.

Notice that I said “unless you work in security”. That’s because the security folks have learned over many years and countless incidents that nothing is truly random or strange. They look for odd remote access requests or strange configuration changes on devices. They wonder why random IP addresses from outside the company are trying to access protected systems. Security professionals treat every random thing as a potential problem. However, that kind of behavior also demonstrates the downside of analyzing every little piece of information for potential threats. You quickly become paranoid about everything and spend a lot of time and energy trying to make sense out of potential nonsense. Is it any wonder that many security pros find themselves jumping at every little shadow in case it’s hiding a beast?

Middle Ground of Mentions

On the one hand, we have systems people that dismiss weirdness until it’s a pattern. On the other we have security pros that are trying to make patterns out of the noise. I’m sure you’re probably wondering if there has to be some kind of middle ground to ensure we’re keeping track of issues without driving ourselves insane.

In fact, there is a good policy that you need to get into the habit of doing. You need to write it all down somewhere. Yes, I’m talking about the dreaded documentation monster. The kind of thing that no one outside of developers likes to do. The mean, nasty, boring process of taking the stuff in your brain and putting it down somewhere so someone can figure out what you’re thinking without the need to read your mind.

You have to write it down because you need to have a record to work from if something goes wrong later. One of the greatest features I’ve ever worked with that seems to be ignored by just about everyone is the Windows Shutdown Reason dialog box in Windows Server 2003 and above. Rebooting a box? You need to write in why and give a good justification. That way if someone wants to know why the server was shut off at 11:15am on a Tuesday they can examine the logs. Unfortunately in my experience the usual reason for these shutdowns was either “a;lkjsdfl;kajsdf” or “because I am doing it”. Which aren’t great justifications for later.

You don’t have to be overly specific with your documentation but you need to give enough detail so that later you can figure out if this is part of a larger issue. Did an application stop responding and need to be restarted? Jot that down. Did you need to kill a process to get another thing running again? Write down that exact sentence. If you needed to restart a router and you ended up needing to restore a configuration you need to jot that part down too. Because you may not even realize you have an issue until you have documentation to point it out.

I can remember doing a support call years ago with a customer and in conversation he asked me if I knew much about Cisco routers. I chuckled and said I knew a bit. He said that he had one that he kept having to copy the configuration files to every time it restarted because it came up blank. He even kept a console cable plugged into it for just that reason. Any CCNA out there knows that’s probably a config register issue so I asked when it started happening. The person told me at least a year ago. I asked if anyone had to get into the router because of a forgotten password or some other lockout. He said that he did have someone come out a year and a half prior to reset a messed up password. Ever since then he had to keep putting the configuration back in. Sure enough, the previous tech hadn’t reset the config register. One quick fix and the customer was very happy to not have to worry about power outages any longer.

Documenting when things happen means you can build a timeline per device or per service to understand when things are acting up. You don’t even need to figure it out yourself. The magic of modern systems lies in machine learning. You may think to yourself that machine learning is just fancy linear regression at this point and you would be right more often than not. But one thing linear regression is great at doing is surfacing patterns of behavior for specific data points. If your router reboots on the third Wednesday of every month precisely at 3:33am the ML algorithms will pick up on that and tell you about it. But that’s only if your system catches the reboot through logs or some other record keeping. That’s why you have to document all your weirdness. Because the ML systems can analyze what they don’t know about.


Tom’s Take

I love writing. And I hate documentation. Documentation is boring, stuffy, and super direct. It’s like writing book reports over and over again. I’d rather write a fun blog post or imagine an exciting short story. However, documentation of issues is critical to modern organizations because these issues can spiral out of hand before you know it. If you don’t write it down it didn’t happen. And you need to know when it happened if you hope to prevent it in the future.

Basics First and Basics Last

This week I found my tech life colliding with my normal life in an unintended and somewhat enlightening way. I went to a store to pick up something that was out of stock and while I was there making small talk the person behind the counter asked me what I did for a living. I mentioned technology and he said that he was going to college for a degree in MIS, which just happens to be the thing I have my degree in. We chatted about that for a few more minutes before he asked me something I get asked all the time.

“What is the one thing I need to make sure I pay attention to in my courses?”

It’s simple enough, right? You’ve done this before and you have the benefit of hindsight. What is the one thing that is most important to know and not screw up? The possible answers floating through my head were all about programming or analytical methods or even the dreaded infrastructure class I slept through and then made a career out of. But what I said was the most boring and most critical answer one could give.

“You need to know the basics backwards and forwards.”

Basics Training

Why do we teach the basics? Why do we even call them that? And why are people so keen on skipping over all of them so fast to get to the cool stuff? You have to understand the basics before you even move on and yet so many want to get the “easy” stuff out of the way because memorizing the OSI model or learning how an array works in programming is mind-numbing.

The basics exist because we all need to know how things work at their most atomic level. We memorize the OSI model in networking because it tells us how things should behave. Sure, TCP/IP blows it away. However, if you know how packets are supposed to work with that model it informs you how you need to approach troubleshooting and software design and even data center layouts.

I’ll admit that I really didn’t pay much attention when I took my Infrastructure class twenty years ago. I was hell-bent on being a consultant or a database admin and who needed to know how a CPU register worked? What was this stupid OSI model they wanted me to know? I’ll just memorize it for the test and be done with it. Needless to say that the intervening years have shown me the folly of not paying attention in that class. If I went back today I’d ace that OSI test with my eyes closed.

The basics seem useless because we can’t do much with them right now. They’re just like Lego bricks. We need uniform pieces with predictable characteristics to help us understand how things are supposed to work together. Without that knowledge of how things work you can’t build on it. If you don’t understand the different between RAM and a hard disk you won’t be able to build systems that rely on both. Better yet, when technology changes to incorporate solid state disks and persistent memory storage you need the basics to understand how they are different and where you need to apply that knowledge.

I once picked up a Cisco Press CCIE study guide for the written exam to brush up on my knowledge before retaking the written. The knowledge in the book seemed easy to me. It was all about spanning tree configurations and OSPF area types and what BGP keepalives were. I felt like it was a remedial text that didn’t give me any new knowledge. That’s when I realized that they knowledge in the book wasn’t supposed to be new. It was supposed to be a reminder of what I already learned in my CCNA and CCNP courses. If anything in the text was truly new, was it something I should have already known?

It’s also part of the reason the CCIE is such a fun exam in the lab. You should already know the basics of how things like RIP and OSPF work. So let’s test those basics in new ways. Any of the training lab you can take from companies like INE or Micronics are filled with tricky little scenarios that make you take the basics and apply them outside the box. That’s because the instructors don’t need to spend time teaching you how RIP forms neighbor relationships or adjacencies. They want to see if you remember how that happens so you can apply it to a question designed to stretch your knowledge. You can only do that when you know the basics.

Graduation Day

Basics aren’t just for learning at the beginning. You should also brush up on them when you’re at the top of your game. Why? Because it will answer questions you might not know you had or explain strange things that rely on the architecture we long-ago forgot about because it seemed basic.

A fun example was years ago in the online game City of Heroes. The players can earn in-game currency to buy and sell things. Eventually the game economy got to the point where players were at the maximum amount of currency for a player. What was that number? It was just over two billion. Pretty odd place to stop, right? What made them think that was a good stopping point? Random chance? Desire to keep the amount of currency in circulation low? Or was there a different reason?

That’s when I asked a simple question: How would you store the currency value in the game’s code? The answer for every programmer out there is an integer. And what’s the maximum value for an integer? For a 32-bit value it’s around four billion. But what if you use a signed integer for some reason? The maximum value is just over two billion in each direction. So the developers used a 32-bit signed integer and that’s why the currency value was capped where it was.

Over and over again in my career I find myself turning back to the basics to answer questions about things I need to understand or solve. We really want the solutions to be complex and hard to understand and solve because that shows our critical thinking skills being applied. However, if you start with the basics approach you’ll find that the solutions to problems or the root causes are often defined by something very basic that has far-reaching consequences. And if you forget how those basics work you’re going to spend a lot of time chasing your tail looking for a complex solution to a simple problem.


Tom’s Take

I don’t think my conversation partner was hoping for the answer I gave him. I’m sure he wanted me to say that this high level course was super important because it taught all the secrets you needed to know in order to succeed in life. Everyone wants to hear that the most important things are exciting and advanced. Finding out that the real key to everything is the basics you learn at the beginning of your journey is disappointing. However, for those that master the basics and remember them at every step of their journey, the end of the road is just as advanced and exciting as it was when you stepped on it in the first place. And you get there with a better understanding of how everything works.

Building Snowflakes On Purpose

We all know that building snowflake networks is bad, right? If it’s not a repeatable process it’s going to end up being a problem down the road. If we can’t refer back to documentation to shows why we did something we’re going to end up causing issues and reducing reliability. But what happens when a snowflake process is required to fix a bigger problem? It’s a fun story that highlights where process can break down sometimes.

Reloaded

I’ve mentioned before that I spent about six months doing telephone tech support for Gateway computers. This was back in 2003 so Windows XP was the hottest operating system out there. The nature of support means that you’re going to be spending more time working on older things. In my case this was Windows 95 and 98. Windows 98 was a pain but it was easy to work on.

One of the most common processes we had for Windows 98 was a system reload. It was the last line of defense to fix massive issues or remove viruses. It was something that was second nature to any of the technicians on the help desk:

  1. Boot from the Gateway tools CD and use GWSCAN to write zeros to the hard drive.
  2. Reboot from the CD and use FDISK to partition the hard disk.
  3. Format the drive.
  4. Insert the Windows 98 OS CD and copy the CAB installation files to a folder on the hard drive.
  5. Run setup from the hard drive and let it complete.
  6. When Windows comes back up, insert the driver CD and let install all the drivers.

The whole process took two or three phone calls to complete. Any time you got to something that would take more than fifteen minutes to complete you logged the steps in the customer trouble ticket and had them call back when it was completed. The process was so standard that it had its own acronym in the documentation – FFR, which stood for “FDISK, Format, Reload”. If you told someone where you were in the process they could finish it no problem.

Me, Me, ME

The whole process was manual with lots of steps and could intimidate customers. At some point in the development process the Gateway folks came up with a solution for Windows ME that they thought worked better. Instead of the manual steps of copying files and drivers and such, Gateway loaded the OS CD with a copy of the image they wanted for their specific model type. The image was installed using ImageCast, an imaging program that just dropped the image down on the drive without the need to do all the other steps. In theory, it was simple and reduced call times for the help desk.

In practice, Windows ME was a disaster to work on. The ImageCast program worked about half the time. If you didn’t pick the right options in the reload process it would partition the hard drive and add a second clean copy of WinME without removing the first one. It would change the MBR to have two installations to choose from with the same identifiers so users would get confused as to which was which. And the image itself seemed to be missing programs and drivers. The fact that there was a Driver CD that shipped with the system made us all wonder what the real idea behind this “improved” process was.

Because Windows ME was such a nightmare to reload, our call center got creative. We had a process that worked for Windows 98. We had all the files we needed on the disks. Why not do it the “right” way? So we did. Informally, the reload process for Windows ME was the same as Windows 98. We would FFR Windows ME boxes and make sure they looked right when they came back. No crazy ImageCasting programs or broken software loads.

The only issue? It was an informal snowflake process. It worked much better but if someone called the main help desk number and got another call center they would be in the middle of an unsupported process. The other call center tech would simply start the regular process and screw up the work we’d done already. To counter that, we would tell the customer to call our special callback voicemail box and not do anything until we called them back. That meant the reload process took many more hours for an already unhappy customer. The end result was better but could lead to frustrations.

Let It Snow

Was our informal snowflake process good or bad? It’s tough to say. It led to happier customers. It meant the likelihood of future support calls was lower because the system was properly reloaded instead of relying on a broken image. However, it was stressful for the tech that worked on the ticket because they had to own the process the whole way through. It also meant that you had to ensure the customer wouldn’t call back and disrupt the process with someone else on the phone.

The process was broken and it needed to be fixed. However, the way to fix the broken process wasn’t easy to figure out. The national line had their process and they were going to stick to it. We came up with an alternative but it wasn’t going to be adopted. Yet, we still kept using our process as often as possible because we felt we were right.

In your enterprise, you need to understand process. Small companies need processes that are repeatable for the ease of their employees. Large companies need processes to keep things consistent. Many companies that face regulatory oversight need processes followed exactly to ensure compliance. The ability to go rogue and just do it the way you want isn’t always desired.

If you have users that are going around the process you need to find out why. We see it all the time. Security rules that get ignored. Documentation requirements that are given the bare minimum effort. Remember the shutdown dialog box in Windows Server 2003? Most people did a token entry before rebooting. And having eighteen entries of “;lkj;kljl;k” doesn’t help figure out what’s going on.

Get your users to tell you why they need a snowflake process. Especially if there’s more than one person relying on it. The odds are very good that they’ve found hiccups that you need to address. Maybe it’s data entry for old systems. Perhaps it’s problem with the requirements or the order of the steps. Whatever it is you have to understand it and fix it. The snowflake may be a better way to do things. You have to investigate and figure it out otherwise your processes will be forever broken.


Tom’s Take

Inbound tech support is full of stories like these. We find a broken process and we go around it. It’s not always the best solution but it’s the one that works for us. However, building a toolbox full of snowflake processes and solutions that no one else knows about is a path to headaches. You need to model your process around what people need to accomplish and how they do it instead of just assuming they’re going to shoehorn their workflow into your checklist. If you process doesn’t work for your people, you’d better get a handle on the situation before you’re buried in a blizzard of snowflakes.

Friction Finders

Do you have a door that sticks in your house? If it’s made out of wood the odds are good that you do. The kind that doesn’t shut properly or sticks out just a touch too far and doesn’t glide open like it used to. I’ve dealt with these kinds of things for years and Youtube is full of useful tricks to fix them. But all those videos start with the same tip: you have to find the place where the door is rubbing before you can fix it.

Enterprise IT is no different. We have to find the source of friction before we can hope to repair it. Whether it’s friction between people and hardware, users and software, or teams going at each other we have to know what’s causing the commotion before we can repair it. Just like with the sticking door, adding more force without understand the friction points isn’t a long-term solution.

Sticky Wickets

Friction comes from a variety of sources. People don’t understand how to use a device or a program. Perhaps it’s a struggle to understand who is supposed to be in charge of a change control or a provisioning process. It could even be as simple as someone applying outside issues to their regular day and causing problems because their interactions with previously stable systems is stilted.

Whatever the reasons for the friction, we need to understand what’s going on before we can start fixing. If we don’t know what we’re trying to solve we’re just going to throw solutions at the wall until something sticks. Then we hope that was the fix and we move on. Shotgun troubleshooting is never a permanent solution to anything. Instead, we need to take repeatable, documented steps to resolve the points of friction.

Ironically enough, the easiest issue to solve is the interpersonal kind. When teams argue about roles or permissions or even who should have the desks at the front of the data center it’s almost always a problem of a person against another person. You can’t patch people. You can’t upgrade people. You can’t even remove people and replace them with a new model (unless it’s a much bigger issue and HR needs to solve it). Instead, it’s a chance to get people talking. Be productive and make sure that everyone knows the outcome is the resolution of the problem. It’s not name calling or posturing. Lay out what the friction point is and make people talk about that. If you can keep them focused on the problem at hand and not at each others’ throats you should be able to get everyone to recognize what needs to change.

Mankind Versus Machines

People fighting people is a people problem. But people against the enterprise system isn’t always a cut-and-dried situation. That’s because machines are predictable in so many different kinds of ways. You can be sure that what you’re going to get is the right answer every time but you may not be able to ask the right questions to find that answer the way you want to.

Think back to how many times you’ve diagnosed a problem only to hit a wall. Maybe it’s that the CPU utilization on a device is higher than it should be. What next? Is it some software application? A bug in the system? Maybe a piece of malware running rampant? If it’s a networking device is it because of a packet flow causing issues? Or a failed table lookup causing a process switch to happen? There are a multitude of things that need to be investigated before you can decide on a course of action.

This is why shotgun troubleshooting is so detrimental to reducing friction. How do we know that the solution we tried isn’t making the problem worse? I wrote over ten years ago about removing fixes that don’t address the problem and it’s still very true today. If you don’t back out the things that didn’t address the issue you’re not only leaving bad fixes in place but you are potentially causing problems down the line when those changes impact other things.

Finding the sources of friction in your systems takes in-depth troubleshooting. Your users just want the problem to go away. They don’t want to answer forty questions about when it started or what’s been installed. They don’t want the perfect solution that ensures the problem never comes back. They just want to be able to work right now and then keep moving forward. That means you need a two-step process of triage and investigation. Get the problem under control and then investigate the friction points. Figure out how it happened and fix it after you get the users back up and running.

Lastly, document the issue and what resolved it. Write it all down somewhere, even if it’s just in your own notes. But if you do that, make sure you have a way of indexing everything so you can refer back to it at some point in the future. Part of the reason why I started this blog was to write down solutions to problems I discovered or changed along the way to ensure that I could always look them up. If you can’t publish them on the Internet or don’t feel comfortable writing it all up, at least use a personal database system like Notion to keep it all in one searchable place. That way you don’t forget how clever you are and go back to reinventing the wheel over and over again every time the problem comes up.


Tom’s Take

Friction exists everywhere. Sometimes it’s necessary to keep us from sliding all over the road or keeping a rug in place in the living room. In enterprise IT friction can create issues with system and teams. Reducing it as much as possible keeps the teams working together and being as productive as possible. You can’t eliminate it completely and you shouldn’t remove it just for the sake of getting rid of something. Instead, analyze what you need to improve or fix and document how you did it. Like any door repair, the results should be immediate and satisfying.

Opening Up Remote Access with Opengear

Opengear OM2200

The Opengear OM2200

If you had told me last year at this time that remote management of devices would be a huge thing in 2020 I might have agreed but laughed quietly. We were traveling down the path of simultaneously removing hardware from our organizations and deploying IoT devices that could be managed easily from the cloud. We didn’t need to access stuff like we did in the past. Even if we did, it was easy to just SSH or console into the system from a jump box inside the corporate firewall. After all, who wants to work on something when you’re not in the office?

Um, yeah. Surprise, surprise.

Turns out 2020 is the Year of Having Our Hair Lit On Fire. Which is a catchy song someone should record. But it’s also the year where we have learned how to stand up 100% Work From Home VPN setups within a week, deploy architecture to the cloud and refactor on the fly to help employees stay productive, and institute massive change freezes in the corporate data center because no one can drive in to do a reboot if someone forgets to do commit confirmed or reload in 5.

Remote management has always been something that was nice to have. Now it’s something that you can’t live without. If you didn’t have a strategy for doing it before or you’re still working with technology that requires octal cables to work, it’s time you jumped into the 21st Century.

High Gear

Opengear is a company that has presented a lot at Tech Field Day. I remember seeing them for the first time when I was a delegate many, many years ago. As I have grown with Tech Field Day, so too have they. I’ve seen them embrace new technologies like cloud management and 4G/LTE connectivity. I’ve heard the crazy stories about fish farms and Australian emergency call boxes and even some stuff that’s too crazy to repeat. But the theme remains the same throughout it all. Opengear is trying to help admins keep their boxes running even if they can’t be there to touch them.

Flash forward to the Hair On Fire year, and Opengear is still coming right along. During the recent Tech Field Day Virtual Cisco Live Experience in June, they showed off their latest offerings for sweet, sweet hardware. Rob Waldie did a great job talking about their new line of NetOps Console servers here in this video:

Now, I know what you’re thinking. NetOps? Really? Trying to cash in on the marketing hype? I would have gone down that road if they hadn’t show off some of the cool things these new devices can do.

How about support for Docker containerized apps? Pretty sure that qualifies at NetOps, doesn’t it? Now, your remote console appliance is capable of doing things like running automation scripts and triggering complex logic when something happens. And, because containers are the way the cloud runs now, you can deploy any number of applications to the console server with ease. It’s about as close at an App Store model as you’re going to find, with some nerd knobs for good measure.

That’s not all though. The new line of console appliances also comes with an embedded trusted platform module (TPM) chip. You’ve probably seen these on laptops or other mobile devices. They do a great job of securing the integrity of the device. It’s super important to have if you’re going to deploy console servers into insecure locations. That way, no one can grab your device and do things they shouldn’t like tapping traffic or trying to do other nefarious things to compromise security.

Last but not least, there’s an option for 64GB of flash storage on the device. I like this because it means I can do creative things like back up configurations to the storage device on a regular basis just in case of an outage. If and when something happens I can just remote to the Opengear server, console to the device, and put the config back where it needs to be. Pretty handy if you have a device with a dying flash card or something that is subject to power issues on a regular basis. And with a LTE-A global cellular modem, you don’t have to worry about shipping the box to a country where it won’t work.


Tom’s Take

I realize that we’re not going to be quarantined forever. But this is a chance for us to see how much we can get done without being in the office. Remember all those budgets for fancy office chairs and the coffee service? They could go to buying Opengear console servers so we can manage devices without truck rolls. Money well spent on reducing the need for human intervention also means a healthier workforce. I trust my family to stay safe with our interactions. But if I have to show up at a customer site to reboot a box? Taking chances even under the best of circumstances. And the fewer chances we take in the short term, the healthier the long-term outlook becomes.

We may never get back to the world we had before. And we may never even find ourselves in a 100% Remote Work environment. But Opengear gives us options that we need in order to find a compromise somewhere in the middle.

If you’d like more information about Opengear’s remote access solutions, make sure you check out their website at http://Opengear.com

Disclaimer: As a staff member of Tech Field Day, I was present during Opengear’s virtual presentation. This post represents my own thoughts and opinions of their presentation. Opengear did not provide any compensation for this post, nor did they request any special consideration when writing it. The conclusions contained herein are mine alone and do not represent the views of my employer.

Failure Is Fine, Learning Is Mandatory

“Failure is a harsh teacher because it gives the test first and the lesson afterward.” — Vernon Law

I’m seeing a thread going around on Twitter today that is encouraging people to share their stories of failure in their career. Maybe it was a time they created a security hole in a huge application. Perhaps it was creating a routing loop in a global corporation. Or maybe it was something as simple as getting confused about two mailboxes and deleting the wrong one and realizing your mail platform doesn’t have undelete functionality.

We fail all the time. We try our hardest and whatever happens isn’t what we want. Some of those that fail just give up and assume that juggling isn’t for them or that they can never do a handstand. Others keep persevering through the pain and challenge and eventually succeed because they learn what they need to know in order to complete their tasks. Failure is common.

What is different is how we process the learning. Some people repeat the same mistakes over and over again because they never learn from them. In a professional setting, toggling the wrong switch when you create someone’s new account has a very low learning potential because it doesn’t affect you down the road. If you accidentally check a box that requires them to change their password every week you’re not going to care because it’s not on your account. However, if the person you do that to has some kind of power to make you switch it back or if the option puts your job in jeopardy you’re going to learn very quickly to change your behavior.

Object Failure

Here’s a quick one that illustrates how the motivation to learn from failure sometimes needs to be more than just “oops, I screwed up”. I’ll make it a bullet point list to save time:

  • Installed new phone system for school district
  • Used MGCP as the control protocol
  • Need to solve a PRI caller ID issue at the middle school
  • Gateway is at the high school
  • Need to see all the call in the system
  • Type debug mgcp packet detail in a telnet session
  • A. Telnet. Session.
  • Router locks up tight and crashes
  • Hear receptionist from the other room say, “Did you just hang up on me?”
  • Panic
  • Panic some more
  • Jump in my car and break a couple of laws getting across town to restart router that I’m locked out of
  • Panic a lot in the five minutes it takes to reboot and reassociate with CallManager
  • Swear I will never do that again

Yes, I did the noob CCIE thing of debugging packets on a processing device in production because I underestimated the power of phone calls as well as my own stupidity. I got better!

But I promise that if I’d have done this and it would have shut down one phone call or caused an issue for one small remote site I wouldn’t have leaned a lesson. I might even still be doing that today to look at issues. The key here is that I shut down call processing for the entire school district for 20 minutes at the end of the school day. You have no idea how many elementary school parents call the front office at the end of the day. I know now.

Lessons have more impact with stress. It’s something we see in a lot of situations where we train people about how to behavior in high pressure situations. I once witnessed a car accident right in front of me on a busy highway and it took my brain almost ten seconds to process that I needed to call emergency services (911 in the US) even though I had spent the last four years programming that dial peer into phone systems and dialing it for Calling Line ID verification. I’d practiced calling 911 for years and when I had to do it for real I almost forgot what to do. We have to know how people are going to react under stress. Or at least anticipate how people are going to behave. Which is why I always configured 9.911 as a dial peer.

Lessons Learned

The other important thing about failure is that you have to take stock of what you learn in the post-mortem. Even if it’s just an exercise you do for yourself. As soon as you realize you made a mistake you need to figure out how to learn from it and prevent that problem again. And don’t just say to yourself, “I’m never doing that again!” You need to think about what caused the issue and how you can ingrain the learning process into your brain.

Maybe it’s something simple like creating a command alias to prevent you from making the wrong typo again and deleting a file system. Maybe it’s forcing yourself to read popup dialog boxes as you click through the system to make sure you’re deleting the right file or formatting the right disk. Someone I used to work with would write down the name of the thing he was deleting and hold it up to the screen before he committed the command to be sure they matched. I never asked what brought that about but I’m sure it was a ton of stress.


Tom’s Take

I screw up. More often than even I realize. I try to learn as much as I can when I’m sifting through the ashes. Maybe it’s figuring out how I went wrong. Perhaps it’s learning why the thing I wanted to do didn’t work the way I wanted it to. It could even be as simple as writing down the steps I took to know where I went wrong and sharing that info with a bunch of strangers on the Internet to keep me from making the same mistake again. As long as you learn something you haven’t failed completely. And if you manage to avoid making the exact same mistake again then you haven’t failed at all.