It’s a crazy idea to think that a network built to be completely decentralized and resilient can be so easily knocked offline in a matter of minutes. But that basically happened twice in the past couple of weeks. CloudFlare is a service provider that offers to sit in front of your website and provide all kinds of important services. They can prevent smaller sites from being knocked offline by an influx of traffic. They can provide security and DNS services for you. They’re quickly becoming an indispensable part of the way the Internet functions. And what happens when we all start to rely on one service too much?
Bad BGP Behavior
The first outage on June 24, 2019 wasn’t the fault of CloudFlare. A small service provider in Pennsylvania decided to use a BGP Optimizer from Noction to do some route optimization inside their autonomous system (AS). That in and of itself shouldn’t have caused a problem. At least, not until someone leaked those routes to the greater internet.
It was a comedy of errors. The provider in question announced their more specific routes to an upstream customer, who in turn announced them to Verizon. After that all bets are off. Because those routes were more specific than the aggregates they became the preferred routes. And when the whole world beats a path to your door to get to the rest of the world, you see issues.
Those issues caused CloudFlare to go offline. And when CloudFlare goes offline everyone starts having issues. The sites they are the front end for go offline, even if the path to those sites is still valid. That’s because CloudFlare is acting as the front end for your site when you use their service. It’s great because it means that when someone knocks your system offline or hits you with a ton of traffic you’re safe because CloudFlare can support a lot more bandwidth than you can, especially if you’re self hosted. But if CloudFlare is out, you’re out of luck.
There was a pretty important lesson to be learned in all this and CloudFlare did an okay job of explaining some of those lessons. But the tone of their article was a bit standoffish and seemed to imply that the people whose responsibility it was to keep the Internet running should do a better job of keeping their house in order. For those of you playing along at home, you’ll realize that the irony overlords were immediately summoned to mete out justice to CloudFlare.
On July 2nd, CloudFlare went down again. This time, instead of seeing issues with routing packets or delays, users of the service were greeted with 502 Bad Gateway errors. Again, when CloudFlare is down your site is down even if you’re not offline. And then the speculation started. Was this another BGP hijack? Was CloudFlare being attacked? No one knew and most of the places you could go look were offline, including one of the biggest offline site detectors, which was a user of CloudFlare services.
CloudFlare eventually posted a blog owning up to the fact that it wasn’t an attack or a BGP issue, but instead was the result of a bad web application firewall (WAF) rule being deployed globally in one go. A single regular expression (regex) was responsible for spiking the CPU utilization of the entirety of the CloudFlare network. And when all your CPUs are cranking along at 100% utilization across the board, you are effectively offline.
In the post-mortem CloudFlare had to eat a little crow and admit that their testing procedures for catching this particular issue were inadequate. To see the stance they took with Verizon and Noction just a week or so before and then to see how they had to admit that this one was all on them was a bit humbling for sure. But, more importantly, it shows that you have to be vigilant in every part of your organization to ensure that some issue that you deploy isn’t going to cause havoc on the other side. Especially if you’re the responsible party of a large percentage of traffic on the web.
I think CloudFlare is doing good work with their services. But I also think that too many people are relying on them to provide services that should be planned out and documented. It’s important to realize that no one service is going to provide all the things you need to stay resilient. You need to know how you’re keeping your site online and what your backup plan is when things go down.
And, if you’re running one of those services, you’d better be careful about running your mouth on the Internet.
Tom, I normally enjoy your blog posts. But this one was especially insightful. Thanks for sharing your views and commentary on this latest issue that effects us all. Well done.