No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.
One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.
Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.
I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.
What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.
Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.
Extending the Backup Bubble
What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?
There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?
You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.
The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.
However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.
Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.
I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.