You may have heard in the news this week that there was a big issue with Southwest Airlines this holiday season. The issues are myriad and this is going to make for some great case studies for students in the future. However, one thing I wanted to touch on briefly in this whole debacle was the issue of a cascade failure.
The short version is that a weather disruption in the flight schedule became a much bigger problem when the process for rescheduling the flight crews was overwhelmed. Turns out that even after the big computer system upgrades and all the IT work that has gone into putting together a modern airfare booking system that one process was still very manual. The air crew rescheduling department was relatively small in nature and couldn’t keep up with the demands placed on it by the disruptions. It got to the point where Southwest had to reduce their number of flights in order to get the system back to normal.
Worst Case Scenario
I’m not an expert at airline scheduling but I have spent a lot of time planning for disaster recovery. One of the things that we focus on more than anything else is the recovery aspect. The whole purpose of the plan is to get things up and running, right? That requires a look at the big picture of what data needs to be saved and where it needs to be stored and how people are going to do it. In the above example the focus of the airline was getting passengers booked on flights as soon as possible.
However, the details matter just as much as the big picture. If you don’t know the process at every step of the way you’re going to find that the weakest link in the chain is the one that breaks. All the upgrades in the world for remote storage or immutable snapshots won’t matter if someone doesn’t have a key to the data center to turn everything on. Just ask the engineers at Facebook that didn’t realize the door controls for the data center relied on the internal systems that were unreachable during their 2021 outage.
How can you catch these little details? How can you be so sure that everything isn’t going to fall apart because someone forgot the keys to the closet with the disaster recovery binder? The answer, of course, is testing. You’re going to have to test every aspect of the plan from top to bottom. Most everyone will agree that you have to test everything properly to make sure no one problem overwhelms the system. However, that’s where this whole thing falls apart.
Forest for the Trees
If you’re just looking at the individual aspects of your disaster recovery plan in a vacuum you’re going to have a miserable time of it when things go wrong. Unit testing is a popular way to look at the components of the plan to ensure they work without incurring too much cost or too many resources for the testing. However, unit testing alone doesn’t look at the way the details integrate.
That’s where integration testing comes into play. It’s not enough to check the individual pieces. Maybe the computer system is good at rescheduling passengers and balancing the gate assignments. However, if they can’t get on the plane because the system doesn’t think there is a crew due to the way the process interacts with a different area then you don’t have a functional system no matter how great one part of it is.
Disaster recovery tests can be done at the unit level to make sure new modules or processes are solid but you have to make sure you have integrated full tests at least once every six months or so. You have to find the holes in the system caused by the interactions of the details. Sure, the generators might fire up on cue but what if someone is parked in front of the fuel delivery area? What if the keys to the backup cages are on someone’s keychain instead of in a central area? These are questions you want to answer before everyone is running around with their hair on fire.
More importantly, when this happens you need to document all of it. If there is a particular integration that fails you need to write it down and discuss it with your teams. Understand why it happened and put process and procedure in place to cover it. Then make sure that everyone is aware the plan was updated. If people think that something has to be done a certain way because that’s the way it’s always done they’re going to keep doing it that way until they are told differently. Communication is key in any kind of adverse situation.
Lastly, be honest when you’re evaluating these process failures. Don’t try to explain it away or minimize the impact. If someone didn’t do their job then make sure everyone knows what needed to happen and how it failed. If a system doesn’t work properly then analyze the system and fix it. Don’t throw blame where it’s not warranted but don’t explain it away to salve an ego. You need to make the process work and make it work every time so that you don’t run into issues again.
Disasters happen. If you’re really lucky you will have something in place to keep the disaster from spiraling out of control. The plural of lucky is good and in order to be good you need to analyze how the process works in concert with every component to make sure there are no weak links. If the chain breaks like it did for Southwest you’ll be very lucky to lose money and customers. If you’re not lucky you’ll lose a lot more.