Sharing Failure as a Learning Model

Earlier this week there was a great tweet from my friends over at Juniper Networks about mistakes we’ve made in networking:

It got some interactions with the community, which is always nice, but it got me to thinking about how we solve problems and learn from our mistakes. I feel that we’ve reached a point where we’re learning from the things we’ve screwed up but we’re not passing it along like we used to.

Write It Down For the Future

Part of the reason why I started my blog was to capture ideas that had been floating in my head for a while. Troubleshooting steps or perhaps even ideas that I wanted to make sure I didn’t forget down the line. All of it was important to capture for the sake of posterity. After all, if you didn’t write it down did it even happen?

Along the way I found that the posts that got significant traction on my site were the ones that involved mistakes. Something I’d done that caused an issue or something I needed to look up through a lot of sources that I distilled down into an easy reference. These kinds of posts are the ones that fly right up to the top of the Google search results. They are how people know you. It could be a terminology post like defining trunks. Or perhaps it’s a question about why your SPF modules are working in a switch.

Once I realized that people loved finding posts that solved problems I made sure to write more of them down. If I found a weird error message I made sure to figure out what it was and then put it up for everyone to find. When I documented weird behaviors of BPDUGuard and BPDUFilter that didn’t match the documentation I wrote it all down, including how I’d made a mistake in the way that I interpreted things. It was just part of the experience for me. Documenting my failures and my learning process could help someone in the future. My hope was that someone in the future would find my post and learn from it like I had.

Chit Chat Channels

It used to be that when you Googled error messages you got lots of results from forum sites or Reddit or other blogs detailing what went wrong and how you fixed it. I assume that is because, just like me, people were doing their research and figuring out what went wrong and then documenting the process. Today I feel like a lot of that type of conversation is missing. I know it can’t have gone away permanently because all networking engineerings make mistakes and solve problems and someone has to know where that went, right?

The answer came to me when I read a Reddit post about networking message boards. The suggestions in the comments weren’t about places to go to learn more. Instead, they linked to Slack channels or Discord servers where people talk about networking. That answer made me realize why the discourse around problem solving and learning from mistakes seems to have vanished.

Slack and Discord are great tools for communication. They’re also very private. I’m not talking about gatekeeping or restrictions on joining. I’m talking about the fact that the conversations that happen there don’t get posted anywhere else. You can join, ask about a problem, get advice, try it, see it fail, try something else, and succeed all without ever documenting a thing. Once you solve the problem you don’t have a paper trail of all the things you tried that didn’t work. You just have the best solution that you did and that’s that.

You know what you can’t do with Slack and Discord? Search them through Google. The logs are private. The free tiers remove messages after a fashion. All that knowledge disappears into thin air. Unlike the Wisdom of the Ancients the issues we solve in Slack are gone as soon as you hit your message limit. No one learns from the mistakes because it looks like no one has made them before.

Going the Extra Mile

I’m not advocating for removing Slack and Discord from our daily conversations. Instead, I’m proposing that when we do solve a hard problem or we make a mistake that others might learn from we say something about it somewhere that people can find it. It could be a blog post or a Reddit thread or some kind of indexable site somewhere.

Even the process of taking what you’ve done and consolidating it down into something that makes sense can be helpful. I saw X, tried Y and Z, and ended up doing B because it worked the best of all. Just the process of how you got to B through the other things that didn’t work will go a long way to help others. Yes, it can be a bit humbling and embarrassing to publish something that admits you that you made a mistake. But It’s also part of the way that we learn as humans. If others can see where we went and understand why that path doesn’t lead to a solution then we’ve effectively taught others too.


Tom’s Take

It may be a bit self-serving for me to say that more people need to be blogging about solutions and problems and such, but I feel that we don’t really learn from it unless we internalize it. That means figuring it out and writing it down. Whether it’s a discussion on a podcast or a back-and-forth conversation in Discord we need to find ways to getting the words out into the world so that others can build on what we’ve accomplished. Google can’t search archives that aren’t on the web. If we want to leave a legacy for the DenverCoder10s of the future that means we do the work now of sharing our failures as well as our successes and letting the next generation learn from us.

Should We Embrace Points of Failure?

failure-pic-1

There was a tweet making the rounds this last week that gave me pause. Max Clark said that we should embrace single points of failure in the network. His post was an impassioned plea to networking rock stars out there to drop redundancy out of their networks and instead embrace these Single Points of Failure (SPoF). The main points Mr. Clark made boil down to a couple of major statements:

  1. Single-device networks are less complex and easier to manage and troubleshoot. Don’t have multiple devices when an all-in-one works better.
  2. Consumer-grade hardware is cheaper and easier to understand, therefore it’s better. Plus, if you need a backup you can just buy a second one and keep it on the shelf.

I’m sure more networking pros out there are practically bristling at these suggestions. Others may read through the original tweet and think this was a tongue-in-cheek post. Let’s look at the argument logically and understand why this has some merit but is ultimately flawed.

Missing Minutes Matter

I’m going to tackle the second point first. The idea that you can use cheaper gear and have cold standby equipment just sitting on the shelf is one that I’ve heard of many times in the past. Why pay more money to have a hot spare or a redundant device when you can just stock a spare part and swap it out when necessary? If your network design decisions are driven completely by cost then this is the most appealing thing you could probably do.

I once worked with a technology director that insisted that we forgo our usual configuration of RAID-5 with a hot spare drive in the servers we deployed. His logic was that the hot spare drive was spinning without actually doing anything. If something did go wrong it was just as easy to slip the spare drive in by taking it off the shelf and firing it up then instead of running the risk that the drive might fail in the server. His logic seemed reasonable enough but there was one variable that he wasn’t thinking about.

Time is always the deciding factor in redundancy planning. In the world of backup and disaster recovery they use the acronym RTO, which stands for Recovery Time Objective. Essentially, how long do you want your systems to be offline before the data is restored? Can you go days without getting your data back? Or do you need it back up and running within hours or even minutes? For some organizations the RTO could even be measured in mere seconds. Every RTO measurement adds additional complexity and cost.

If you can go days without your data then a less expensive tape solution is best because it is the cheapest per byte stored and lasts forever. If your RTO is minutes or less you need to add hardware that replicates changes or mirrors the data between sites to ensure there is always an available copy somewhere out there. Time is the deciding factor here, just as it is in the redundancy example above.

Can your network tolerate hours of downtime while you swap in a part from off the shelf? Remember that you’re going to need to copy the configuration over to it and ensure it’s back up and running. If it is a consumer-grade device there probably isn’t an easy way to console in and paste the config. Maybe you can upload a file from the web GUI but the odds are pretty good that you’re looking at downtime at least in the half-hour range if not more. If your office can deal with that then Max’s suggestions should work just fine.

For organizations that need to be back up and running in less than hours, you need to have fault tolerance in your network. Redundant paths for traffic or multiple devices to eliminate single points of failure are the only way to ensure that traffic keeps flowing in the event of a hardware failure. Sure, it’s more complicated to troubleshoot. But the time you spend making it work correctly is not time you’re going to spend copying configurations to a cold device while users and stakeholders are yelling at you to get things back online.

All-In-Wonder

Let’s look at the first point here. Single box solutions are better because they are simple to manage and give you everything you could need. Why buy a separate switch, firewall, and access point when you can get them all in one package? This is the small office / branch office model. SD-WAN has even started moving down this path for smaller deployments by pushing all the devices you need into one footprint.

It’s not unlike the TVs you can buy in the big box stores that have DVD players, VHS players, and even some streaming services built in. They’re easy to use because there are no extra wires to plug in and no additional remote controls to lose. Everything works from one central location and it’s simple to manage. The package is a great solution when you need to watch old VHS tapes or DVDs from your collection infrequently.

Of course, most people understand the drawbacks of this model. Those devices can break. They are much harder to repair when they’re all combined. Worse yet, if the DVD player breaks and you need to get it repaired you lose the TV completely during the process instead of just the DVD player. You also can’t upgrade the components individually. Want to trade out that DVD for a Blu-Ray player? You can’t unless you install one on its own. Want to keep those streaming apps up-to-date? Better hope the TV has enough memory to keep current. Event state-of-the-art streaming boxes will eventually be incapable of running the latest version of popular software.

All-in-one devices are best left to the edges of the network. They function well in offices with a dozen or so workers. If something goes bad on the device it’s easier to just swap the whole thing instead of trying to repair the individual parts. That same kind of mentality doesn’t work quite so well in a larger data center. The fact that most of these unified devices don’t take rack mounting ears or fit into a standard data center rack should be a big hint that they aren’t designed for use in a place that keeps the networking pieces off of someone’s desk.


Tom’s Take

I smiled a bit when I read the tweet that started this whole post. I’m sure that the networks that Max has worked on work much better with consumer all-in-one devices. Simple configurations and cold spares are a perfectly acceptable solution for law offices or tag agencies or other places that don’t measure their downtime in thousands of dollars per second. I’m not saying he’s wrong. I’m saying that his solution doesn’t work everywhere. You can’t run the core of an ISP with some SMB switches. You should run your three-person law office with a Cat6500. You need to decide what factors are the most important for you. Don’t embrace failure without thought. Figure out how tolerant you or your customers are of failure and design around it as best you can. Once you can do that you’ll have a much better idea of how to build your network with the fewest points of failure.

Failure Is Fine, Learning Is Mandatory

“Failure is a harsh teacher because it gives the test first and the lesson afterward.” — Vernon Law

I’m seeing a thread going around on Twitter today that is encouraging people to share their stories of failure in their career. Maybe it was a time they created a security hole in a huge application. Perhaps it was creating a routing loop in a global corporation. Or maybe it was something as simple as getting confused about two mailboxes and deleting the wrong one and realizing your mail platform doesn’t have undelete functionality.

We fail all the time. We try our hardest and whatever happens isn’t what we want. Some of those that fail just give up and assume that juggling isn’t for them or that they can never do a handstand. Others keep persevering through the pain and challenge and eventually succeed because they learn what they need to know in order to complete their tasks. Failure is common.

What is different is how we process the learning. Some people repeat the same mistakes over and over again because they never learn from them. In a professional setting, toggling the wrong switch when you create someone’s new account has a very low learning potential because it doesn’t affect you down the road. If you accidentally check a box that requires them to change their password every week you’re not going to care because it’s not on your account. However, if the person you do that to has some kind of power to make you switch it back or if the option puts your job in jeopardy you’re going to learn very quickly to change your behavior.

Object Failure

Here’s a quick one that illustrates how the motivation to learn from failure sometimes needs to be more than just “oops, I screwed up”. I’ll make it a bullet point list to save time:

  • Installed new phone system for school district
  • Used MGCP as the control protocol
  • Need to solve a PRI caller ID issue at the middle school
  • Gateway is at the high school
  • Need to see all the call in the system
  • Type debug mgcp packet detail in a telnet session
  • A. Telnet. Session.
  • Router locks up tight and crashes
  • Hear receptionist from the other room say, “Did you just hang up on me?”
  • Panic
  • Panic some more
  • Jump in my car and break a couple of laws getting across town to restart router that I’m locked out of
  • Panic a lot in the five minutes it takes to reboot and reassociate with CallManager
  • Swear I will never do that again

Yes, I did the noob CCIE thing of debugging packets on a processing device in production because I underestimated the power of phone calls as well as my own stupidity. I got better!

But I promise that if I’d have done this and it would have shut down one phone call or caused an issue for one small remote site I wouldn’t have leaned a lesson. I might even still be doing that today to look at issues. The key here is that I shut down call processing for the entire school district for 20 minutes at the end of the school day. You have no idea how many elementary school parents call the front office at the end of the day. I know now.

Lessons have more impact with stress. It’s something we see in a lot of situations where we train people about how to behavior in high pressure situations. I once witnessed a car accident right in front of me on a busy highway and it took my brain almost ten seconds to process that I needed to call emergency services (911 in the US) even though I had spent the last four years programming that dial peer into phone systems and dialing it for Calling Line ID verification. I’d practiced calling 911 for years and when I had to do it for real I almost forgot what to do. We have to know how people are going to react under stress. Or at least anticipate how people are going to behave. Which is why I always configured 9.911 as a dial peer.

Lessons Learned

The other important thing about failure is that you have to take stock of what you learn in the post-mortem. Even if it’s just an exercise you do for yourself. As soon as you realize you made a mistake you need to figure out how to learn from it and prevent that problem again. And don’t just say to yourself, “I’m never doing that again!” You need to think about what caused the issue and how you can ingrain the learning process into your brain.

Maybe it’s something simple like creating a command alias to prevent you from making the wrong typo again and deleting a file system. Maybe it’s forcing yourself to read popup dialog boxes as you click through the system to make sure you’re deleting the right file or formatting the right disk. Someone I used to work with would write down the name of the thing he was deleting and hold it up to the screen before he committed the command to be sure they matched. I never asked what brought that about but I’m sure it was a ton of stress.


Tom’s Take

I screw up. More often than even I realize. I try to learn as much as I can when I’m sifting through the ashes. Maybe it’s figuring out how I went wrong. Perhaps it’s learning why the thing I wanted to do didn’t work the way I wanted it to. It could even be as simple as writing down the steps I took to know where I went wrong and sharing that info with a bunch of strangers on the Internet to keep me from making the same mistake again. As long as you learn something you haven’t failed completely. And if you manage to avoid making the exact same mistake again then you haven’t failed at all.