Should We Embrace Points of Failure?

failure-pic-1

There was a tweet making the rounds this last week that gave me pause. Max Clark said that we should embrace single points of failure in the network. His post was an impassioned plea to networking rock stars out there to drop redundancy out of their networks and instead embrace these Single Points of Failure (SPoF). The main points Mr. Clark made boil down to a couple of major statements:

  1. Single-device networks are less complex and easier to manage and troubleshoot. Don’t have multiple devices when an all-in-one works better.
  2. Consumer-grade hardware is cheaper and easier to understand, therefore it’s better. Plus, if you need a backup you can just buy a second one and keep it on the shelf.

I’m sure more networking pros out there are practically bristling at these suggestions. Others may read through the original tweet and think this was a tongue-in-cheek post. Let’s look at the argument logically and understand why this has some merit but is ultimately flawed.

Missing Minutes Matter

I’m going to tackle the second point first. The idea that you can use cheaper gear and have cold standby equipment just sitting on the shelf is one that I’ve heard of many times in the past. Why pay more money to have a hot spare or a redundant device when you can just stock a spare part and swap it out when necessary? If your network design decisions are driven completely by cost then this is the most appealing thing you could probably do.

I once worked with a technology director that insisted that we forgo our usual configuration of RAID-5 with a hot spare drive in the servers we deployed. His logic was that the hot spare drive was spinning without actually doing anything. If something did go wrong it was just as easy to slip the spare drive in by taking it off the shelf and firing it up then instead of running the risk that the drive might fail in the server. His logic seemed reasonable enough but there was one variable that he wasn’t thinking about.

Time is always the deciding factor in redundancy planning. In the world of backup and disaster recovery they use the acronym RTO, which stands for Recovery Time Objective. Essentially, how long do you want your systems to be offline before the data is restored? Can you go days without getting your data back? Or do you need it back up and running within hours or even minutes? For some organizations the RTO could even be measured in mere seconds. Every RTO measurement adds additional complexity and cost.

If you can go days without your data then a less expensive tape solution is best because it is the cheapest per byte stored and lasts forever. If your RTO is minutes or less you need to add hardware that replicates changes or mirrors the data between sites to ensure there is always an available copy somewhere out there. Time is the deciding factor here, just as it is in the redundancy example above.

Can your network tolerate hours of downtime while you swap in a part from off the shelf? Remember that you’re going to need to copy the configuration over to it and ensure it’s back up and running. If it is a consumer-grade device there probably isn’t an easy way to console in and paste the config. Maybe you can upload a file from the web GUI but the odds are pretty good that you’re looking at downtime at least in the half-hour range if not more. If your office can deal with that then Max’s suggestions should work just fine.

For organizations that need to be back up and running in less than hours, you need to have fault tolerance in your network. Redundant paths for traffic or multiple devices to eliminate single points of failure are the only way to ensure that traffic keeps flowing in the event of a hardware failure. Sure, it’s more complicated to troubleshoot. But the time you spend making it work correctly is not time you’re going to spend copying configurations to a cold device while users and stakeholders are yelling at you to get things back online.

All-In-Wonder

Let’s look at the first point here. Single box solutions are better because they are simple to manage and give you everything you could need. Why buy a separate switch, firewall, and access point when you can get them all in one package? This is the small office / branch office model. SD-WAN has even started moving down this path for smaller deployments by pushing all the devices you need into one footprint.

It’s not unlike the TVs you can buy in the big box stores that have DVD players, VHS players, and even some streaming services built in. They’re easy to use because there are no extra wires to plug in and no additional remote controls to lose. Everything works from one central location and it’s simple to manage. The package is a great solution when you need to watch old VHS tapes or DVDs from your collection infrequently.

Of course, most people understand the drawbacks of this model. Those devices can break. They are much harder to repair when they’re all combined. Worse yet, if the DVD player breaks and you need to get it repaired you lose the TV completely during the process instead of just the DVD player. You also can’t upgrade the components individually. Want to trade out that DVD for a Blu-Ray player? You can’t unless you install one on its own. Want to keep those streaming apps up-to-date? Better hope the TV has enough memory to keep current. Event state-of-the-art streaming boxes will eventually be incapable of running the latest version of popular software.

All-in-one devices are best left to the edges of the network. They function well in offices with a dozen or so workers. If something goes bad on the device it’s easier to just swap the whole thing instead of trying to repair the individual parts. That same kind of mentality doesn’t work quite so well in a larger data center. The fact that most of these unified devices don’t take rack mounting ears or fit into a standard data center rack should be a big hint that they aren’t designed for use in a place that keeps the networking pieces off of someone’s desk.


Tom’s Take

I smiled a bit when I read the tweet that started this whole post. I’m sure that the networks that Max has worked on work much better with consumer all-in-one devices. Simple configurations and cold spares are a perfectly acceptable solution for law offices or tag agencies or other places that don’t measure their downtime in thousands of dollars per second. I’m not saying he’s wrong. I’m saying that his solution doesn’t work everywhere. You can’t run the core of an ISP with some SMB switches. You should run your three-person law office with a Cat6500. You need to decide what factors are the most important for you. Don’t embrace failure without thought. Figure out how tolerant you or your customers are of failure and design around it as best you can. Once you can do that you’ll have a much better idea of how to build your network with the fewest points of failure.