Reliability | The Networking Nerd

Remember that Slack outage earlier this month? The one that happened when we all got back from vacation and tried to jump on to share cat memes and emojis? We all chalked it up to gremlins and went on going through our pile of email until it came back up. The post-mortem came out yesterday and there were two things that were interesting to me. Both of them have implications on reliability planning and how we handle the worst-case scenarios we come up with.

It’s Out of Our Hands

The first thing that came up in the report was that the specific cause for the outage came from an AWS Transit Gateway not being able to scale fast enough to handle the demand spike that came when we all went back to work on the morning of January 4th. What, the cloud can’t scale?

The cloud is practically limitless when it comes to resources. We can create instances with massive CPU resources or storage allocations or even networking pipelines. However, we can’t create them instantly. No matter how much we need it takes time to do the basic provisioning to get it up and running. It’s the old story of eating an elephant. We do it one bite at a time. Normally we tell the story to talk about breaking a task down into smaller parts. In this case, it’s a reminder that even the biggest thing out there has to be dealt with in small pieces as well.

Slack learned this lesson the hard way. Why? Because they couldn’t foresee a time when their service was so popular that the amount of traffic rushing to their servers crushed the transit gateways. Other companies have had to learn this lesson the hard way too. Disney crawled on launch day because of demand. The release of a new game that requires downloading and patching on Day One also puts stress on servers. Even the lines outside of department stores on Black Friday (in less pandemic-driven years) are examples of what happens when capacity planning doesn’t meet demand.

When you plan for your worst case scenario, you have to think the unthinkable. Instead of asking yourself what might happen if everyone logs on at the same time you also need to ask when happens if they try and something goes wrong. I spent a lot of time in my former job thinking about simple little exercises like VDI boot storms, where office workers can push a storage system to the breaking point by simply turning all their machines on at the same time. It’s the equivalent of being on a shared network resource like a cable modem during the Super Bowl. There aren’t any resources available for you to use.

When we plan for capacity, we have to realize that even our most optimistic projections of usage are going to be conservative if we take off or go viral. Rather than guessing what that might be, take an hour every six months and readjust your projections. See how fast you’re growing. Plan for that crazy scenario where everyone decides to log on at the same time on a day where no one has had their coffee yet. And be ready for what happens when someone throws a wrench into the middle of the process.

What Happens When It Goes Wrong?

The second thing that came up in the Slack post-mortem that is just as worrisome was the behavior of the application when it realized there was a connection timeout. The app started waiting for the pathway to be open again. And guess what happened when AWS was able to scale the transit gateway? Slack clients started hammering the servers with connection requests. The result was something akin to a Distributed Denial-of-Service (DDoS) attack.

Why would the Slack client do this? I think it has something to do with the way the developers coded it and didn’t anticipate every client trying to reconnect all at once. It’s not entirely unsound thinking to be honest. How could we live in a world where every Slack user would be disconnected all at once? Yet, we do and it did and look what happened?

Ethernet figured this part out a long time ago. The CSMA/CD method for detecting collisions on a layer 2 connection has an ingenious solution for what happens when a collision is detected. Once it realizes that there was a problem on the wire it stops what is going on and calculates a random backoff timer based on the number of detected collisions. Once that timer has expired it attempts to transmit again. Because there has to be another station involved in a collision incident both stations do this. The random element of the timer calculation ensures that the likelihood of both stations choosing to transmit again at the same time is very, very low.

If Ethernet behaved like the Slack client did we would never resolve collisions. If every station on a layer 2 network immediately tried to retransmit without a backoff timer the bus would be jammed constantly. The architects of the protocol figured out that every station needs a cool off period to clear the wire before trying again. And it needs to be different for every station so there is no overlap.

Slack really needs to take this idea into account. Rather than pouncing on a connection as soon as it’s available there needs to be a backoff timer that prevents the servers from being swamped. Even a few hundred milliseconds per client could have prevented this outage from getting as big as it did. Slack didn’t plan beyond the worst case scenario because they never conceived of their worst case scenario coming to pass. How could it get worse than something we couldn’t imagine happening?

Tom’s Take

If you design systems or call yourself a reliability engineer, you need to develop a hobby of coming up with disastrous scenarios. Think of the worst possible way for something to fail. Now, imagine it getting worse. Assume that nothing will work properly when there is a recovery attempt. Plan for things to be so bad that you’re in a room on fire trying to put everything out while you’re also on fire. It sounds very dramatic but that’s how bad it can get. If you’re ready for that then nothing will surprise you. You also need to make sure you’re going back and thinking of new things all the time. You never know which piece is going to fail and how it will impact what you’re working on. But thinking through it sometimes gives you an idea of where to go when it all goes wrong.