Redundancy Is Not Resiliency

Most people carry a spare tire in their car. It’s there in case you get a flat and need to change the tire before you can be on your way again. In my old VAR job I drove a lot away from home and to the middle of nowhere so I didn’t want to rely on roadside assistance. Instead I just grabbed the extra tire out of the back if I needed it and went on my way. However, the process wasn’t entirely hitless. Even the pit crew for a racing team needs time to change tires. I could probably get it done in 20 minutes with appropriate cursing but those were 20 minutes that I wasn’t doing anything else beyond fixing a tire.

Spare tires are redundant. You have an extra thing to replace something that isn’t working. IT operations teams are familiar with redundant systems. Maybe you have a cold spare on the shelf for a switch that might go down. You might have a cold or warm data center location for a disaster. You could even have redundant devices in your enterprise to help you get back in to your equipment if something causes it to go offline. Well, I say that you do. If you’re Meta/Facebook you didn’t have them this time last year.

Don’t mistake redundancy for resilience though. Like the tire analogy above you’re not going to be able to fix a flat while you’re driving. Yes, I’ve seen the crazy video online of people doing that but aside from stunt driving you’re going to have to take some downtime from your travel to fix the tire. Likewise, a redundant setup that includes cold spares or out-of-band devices that are connected directly to your network could incur downtime if they go offline and lock you out of your management system. Facebook probably thought their out-of-band control system worked just fine. Until it didn’t.

The Right Gear for Resilience

At Networking Field Day 29 last week we were fortunate to see Opengear present again for the second time. I’m familiar with them from all the way back at Networking Field Day 2 in 2011 so their journey through the changes of networking over the past decade has been great to see. They make out-of-band devices and they make them well. They’re one of the companies that immediately spring to mind when you think about solutions for getting access to devices outside the normal network access method.

As a VAR there were times that I needed to make calls to locations in order to reboot devices or get console access to fix an issue. Whether it was driving 3 hours to press F1 to clear a failed power supply message or racing across town to restore phone service after locking myself out of an SSH session there are numerous reasons why having actual physical access to the console is important. Until we perfect quantum teleportation we’re going to have to solve that problem with technology. Here’s a video from the Networking Field Day session that highlights some of the challenges and solutions that Opengear has available:

Ryan Hogg brings up a great argument here for redundancy versus resiliency. Are you managing your devices in-band? Or do you have a dedicated management network? And what’s your backup for that dedicated network if something goes offline? VLAN separation isn’t good enough. In the event of a failure mode, such as a bridging loop or another attack that takes a switch offline you won’t be able to access the management network if you can’t sent packets through it. If the tire goes flat you’re stopped until it’s fixed.

Opengear solves this problem in a number of ways. The first is of course providing a secondary access method to your network. Opengear console devices have a cellular backup function that can allow you to access them in the event of an outage, either from the internal network or from the Internet going down. I can think of a couple of times in my career where I would have loved to have been able to connect to a cellular interface to undo a change that just happened that had unintended consequences. Sometimes reload in 5 doesn’t quite do the job. Having a reliable way to connect to your core equipment makes life easy for network operating systems that don’t keep from making mistakes.

However, as mentioned, redundancy is not resiliency. It’s not enough for us to have access to fix the problem while everything is down and the world is on fire. We may be able to get back in and fix the issue without needing to drive to the site but the users in that location are still down while we’re working. SD-WAN devices have offered us diverse connectivity options for a number of years now. If the main broadband line goes down just fail back to the cellular connection for critical traffic until it comes back up. Easy to do now that we have the proper equipment to create circuit diversity.

As outlined in the video above, Opengear has the same capability as well. If you don’t have a fancy SD-WAN edge device you can still configure Opengear console devices to act as a secondary egress point. It’s as simple as configuring the network with an IP SLA to track the state of the WAN link and installing the cellular route in the routing table if that link goes down. Once configured your users can get out on the backup link while you’re coming in to fix whatever caused the issue. If it’s the ISP having the issue you can log a ticket and confirm things are working on-site without having to jump in a car to see what your users see.

Resilience Really Matters

One of the things that Opengear has always impressed me with is their litany of use cases for their devices. I can already think of a ton of ways that I could implement something like this for customers that need monitoring and resilient connectivity options. Remote offices are an obvious choice but so too are locations with terrible connectivity options.

If you are working in a location with spotty connectivity you can easily deploy an Opengear device to keep an eye on the network and/or servers as well as providing an extra way for the site to get back online in the event of an issue. If the WAN circuit goes down you can just hop over to the cellular link until you get it fixed. Opengear will tell you something happened and you can log into the Lighthouse central management system to go there and collect data. If configured correctly your users may not even realize they’re offline! We’re almost at the point of changing the tire while we’re driving.


Tom’s Take

I am often asked if I miss working on networking equipment since I rarely touch it these days. As soon as I’m compelled to answer that question I remember all the times I had to drive somewhere to fix an issue. Wasted time can never be recovered. Resources cost money whether it’s money for a device or time spent going to fix one. I look at the capabilities that a company like Opengear has today and I wish I had those fifteen years ago and could deploy them to places I know needed them. In my former line of work redundant things were hard to come by. Resilient options were much more appealing because they offered more than just a back plan in case of failure. You need to pick resiliency every time because otherwise you’re going to be losing time replacing that tire when you could be rolling along fixing it instead.


Disclaimer

Opengear was a presenter at Networking Field Day 29 on September 7, 2022. I am an employee of Tech Field Day, which is the company that managed the event. This blog post represents my own personal thoughts about the presentation and is not the opinion of Tech Field Day. Opengear did not provide compensation for this post or ask for editorial approval. This post is my perspective alone.

Who Pays The Price of Redundancy?

No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.

One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.

Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.

Backup Plans

I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.

What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.

Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.

Extending the Backup Bubble

What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?

There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?

You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.

The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.

However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.

Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.


Tom’s Take

I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.