I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.
Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.
Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:
That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.
Ludicrous Case Scenarios
One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.
To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.
Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?
I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?
Fruits of the Poisoned Tree
That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.
Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.
Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.
In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.
Thus Endeth The Lesson
What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.
You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.
There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.