Redundancy Is Not Resiliency


Most people carry a spare tire in their car. It’s there in case you get a flat and need to change the tire before you can be on your way again. In my old VAR job I drove a lot away from home and to the middle of nowhere so I didn’t want to rely on roadside assistance. Instead I just grabbed the extra tire out of the back if I needed it and went on my way. However, the process wasn’t entirely hitless. Even the pit crew for a racing team needs time to change tires. I could probably get it done in 20 minutes with appropriate cursing but those were 20 minutes that I wasn’t doing anything else beyond fixing a tire.

Spare tires are redundant. You have an extra thing to replace something that isn’t working. IT operations teams are familiar with redundant systems. Maybe you have a cold spare on the shelf for a switch that might go down. You might have a cold or warm data center location for a disaster. You could even have redundant devices in your enterprise to help you get back in to your equipment if something causes it to go offline. Well, I say that you do. If you’re Meta/Facebook you didn’t have them this time last year.

Don’t mistake redundancy for resilience though. Like the tire analogy above you’re not going to be able to fix a flat while you’re driving. Yes, I’ve seen the crazy video online of people doing that but aside from stunt driving you’re going to have to take some downtime from your travel to fix the tire. Likewise, a redundant setup that includes cold spares or out-of-band devices that are connected directly to your network could incur downtime if they go offline and lock you out of your management system. Facebook probably thought their out-of-band control system worked just fine. Until it didn’t.

The Right Gear for Resilience

At Networking Field Day 29 last week we were fortunate to see Opengear present again for the second time. I’m familiar with them from all the way back at Networking Field Day 2 in 2011 so their journey through the changes of networking over the past decade has been great to see. They make out-of-band devices and they make them well. They’re one of the companies that immediately spring to mind when you think about solutions for getting access to devices outside the normal network access method.

As a VAR there were times that I needed to make calls to locations in order to reboot devices or get console access to fix an issue. Whether it was driving 3 hours to press F1 to clear a failed power supply message or racing across town to restore phone service after locking myself out of an SSH session there are numerous reasons why having actual physical access to the console is important. Until we perfect quantum teleportation we’re going to have to solve that problem with technology. Here’s a video from the Networking Field Day session that highlights some of the challenges and solutions that Opengear has available:

Ryan Hogg brings up a great argument here for redundancy versus resiliency. Are you managing your devices in-band? Or do you have a dedicated management network? And what’s your backup for that dedicated network if something goes offline? VLAN separation isn’t good enough. In the event of a failure mode, such as a bridging loop or another attack that takes a switch offline you won’t be able to access the management network if you can’t sent packets through it. If the tire goes flat you’re stopped until it’s fixed.

Opengear solves this problem in a number of ways. The first is of course providing a secondary access method to your network. Opengear console devices have a cellular backup function that can allow you to access them in the event of an outage, either from the internal network or from the Internet going down. I can think of a couple of times in my career where I would have loved to have been able to connect to a cellular interface to undo a change that just happened that had unintended consequences. Sometimes reload in 5 doesn’t quite do the job. Having a reliable way to connect to your core equipment makes life easy for network operating systems that don’t keep from making mistakes.

However, as mentioned, redundancy is not resiliency. It’s not enough for us to have access to fix the problem while everything is down and the world is on fire. We may be able to get back in and fix the issue without needing to drive to the site but the users in that location are still down while we’re working. SD-WAN devices have offered us diverse connectivity options for a number of years now. If the main broadband line goes down just fail back to the cellular connection for critical traffic until it comes back up. Easy to do now that we have the proper equipment to create circuit diversity.

As outlined in the video above, Opengear has the same capability as well. If you don’t have a fancy SD-WAN edge device you can still configure Opengear console devices to act as a secondary egress point. It’s as simple as configuring the network with an IP SLA to track the state of the WAN link and installing the cellular route in the routing table if that link goes down. Once configured your users can get out on the backup link while you’re coming in to fix whatever caused the issue. If it’s the ISP having the issue you can log a ticket and confirm things are working on-site without having to jump in a car to see what your users see.

Resilience Really Matters

One of the things that Opengear has always impressed me with is their litany of use cases for their devices. I can already think of a ton of ways that I could implement something like this for customers that need monitoring and resilient connectivity options. Remote offices are an obvious choice but so too are locations with terrible connectivity options.

If you are working in a location with spotty connectivity you can easily deploy an Opengear device to keep an eye on the network and/or servers as well as providing an extra way for the site to get back online in the event of an issue. If the WAN circuit goes down you can just hop over to the cellular link until you get it fixed. Opengear will tell you something happened and you can log into the Lighthouse central management system to go there and collect data. If configured correctly your users may not even realize they’re offline! We’re almost at the point of changing the tire while we’re driving.


Tom’s Take

I am often asked if I miss working on networking equipment since I rarely touch it these days. As soon as I’m compelled to answer that question I remember all the times I had to drive somewhere to fix an issue. Wasted time can never be recovered. Resources cost money whether it’s money for a device or time spent going to fix one. I look at the capabilities that a company like Opengear has today and I wish I had those fifteen years ago and could deploy them to places I know needed them. In my former line of work redundant things were hard to come by. Resilient options were much more appealing because they offered more than just a back plan in case of failure. You need to pick resiliency every time because otherwise you’re going to be losing time replacing that tire when you could be rolling along fixing it instead.


Disclaimer

Opengear was a presenter at Networking Field Day 29 on September 7, 2022. I am an employee of Tech Field Day, which is the company that managed the event. This blog post represents my own personal thoughts about the presentation and is not the opinion of Tech Field Day. Opengear did not provide compensation for this post or ask for editorial approval. This post is my perspective alone.

1 thought on “Redundancy Is Not Resiliency

  1. Pingback: Redundancy Is Not Resiliency - Tech Field Day

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s