How are you supposed to handle outages? What happens when everything around you goes upside down in an instant? How much communication is “too much”? Or “not enough”? And is all of this written down now instead of being figured out when the world is on fire?
You might have noticed this week that Webex Teams spent most of the week down. Hard. Well, you might have noticed if you used Microsoft Teams, Slack, or any other messaging service that wasn’t offline. Webex Teams went offline about 8:00pm EDT Monday night. At first, most people just thought it was a momentary outage and things would be back up. However, as the hours wore on and Cisco started updating the incident page with more info it soon became apparent that Teams was not coming back soon. In fact, it took until Thursday for most of the functions to be restored from whatever knocked them offline.
What happened? Well, most companies don’t like to admit what exactly went wrong. For every CloudFlare or provider that has full disclosures on their site of outages, there are many more companies that will eventually release a statement with the least amount of technical detail possible to avoid any embarrassment. Cisco is currently in the latter category, with most guesses landing on some sort of errant patch that mucked things up big time behind the scenes.
It’s easy to see when big services go offline. If Netflix or Facebook are down hard then it can impact the way we go about our lives. On the occasions when our work tools like Slack or Google Docs are inoperable it impacts our productivity more than our personal pieces. But each and every outage does have some lessons that we can take away and learn for our own IT infrastructure or software operations. Don’t think that companies that are that big and redundant everywhere can’t be affected by outages regularly.
Stepping Through The Minefield
How do you handle your own outage? Well, sometimes it does involve eating some humble pie.
- Communicate – This one is both easy and hard. You need to tell people what’s up. You need to let everyone know things are working right and you’re working to make them right. Sometimes that means telling people exactly what’s affected. Maybe you can log into Facebook but not Chat or Messages. Tell people what they’re going to see. If you don’t communicate, you’re going to have people guessing. That’s not good.
- Triage – Figure out what’s wrong. Make educated guesses if nothing stands out. Usually, the culprits are big changes that were just made or perhaps there is something external that is affecting your performance. The key is to isolate and get things back as soon as possible. That’s why big upgrades always have a backout plan. In the event that things go sideways, you need to get back to functional as soon as you can. New features that are offline aren’t as good as tried-and-true stuff that’s reachable.
- Honest Post-Mortem – This is the hardest part. Once you have things back in place, you have to figure out why the broke. This is usually where people start running for the hills and getting evasive. Did someone apply a patch at the wrong time? Did a microcode update get loaded to the wrong system? How can this be prevented in the future? The answers to these questions are often hard to get because the people that were affected and innocent often want to find the guilty parties and blame someone so they don’t look suspect. The guilty parties want to avoid blame and hide in public with the rest of the team. You won’t be able to get to the bottom of things unless you find out what went wrong and correct it. If it’s a process, fix it. If it’s a person, help them. If it’s a strange confluence of unrelated events that created the perfect storm, make sure that can never happen again.
- Communicate (Again) – This is usually where things fall over for most companies. Even the best ones get really good at figuring out how to prevent problems. However, most of them rarely tell anyone else what happened. They hide it all and hope that no one ever asks about anything. Yet, transparency is key in today’s world. Services that bounce up and down for no reason are seen as unstable. Communicating as to their volatility is the only way you can make sure that people have faith that they’re going to stay available. Once you’ve figure out what went wrong and who did it, you need to tell someone what happened. Because the alternative is going to be second guessing and theories that don’t help anyone.
I don’t envy the people at Cisco that spent their entire week working to get Webex Teams back up and running. I do appreciate their work. But I want to figure out where they went wrong. I want to learn. I want to say to myself, “Never do that thing that they did.” Or maybe it’s a strange situation that can be avoided down the road. The key is communication. We have to know what happened and how to avoid it. That’s the real learning experience when failure comes around. Not the fix, but the future of never letting it happen again.