Who Pays The Price of Redundancy?

No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.

One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.

Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.

Backup Plans

I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.

What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.

Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.

Extending the Backup Bubble

What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?

There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?

You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.

The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.

However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.

Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.


Tom’s Take

I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.