What Happens When The Internet Breaks?

It’s a crazy idea to think that a network built to be completely decentralized and resilient can be so easily knocked offline in a matter of minutes. But that basically happened twice in the past couple of weeks. CloudFlare is a service provide that offers to sit in front of your website and provide all kinds of important services. They can prevent smaller sites from being knocked offline by an influx of traffic. They can provide security and DNS services for you. They’re quickly becoming an indispensable part of the way the Internet functions. And what happens when we all start to rely on one service too much?

Bad BGP Behavior

The first outage on June 24, 2019 wasn’t the fault of CloudFlare. A small service provider in Pennsylvania decided to use a BGP Optimizer from Noction to do some route optimization inside their autonomous system (AS). That in and of itself shouldn’t have caused a problem. At least, not until someone leaked those routes to the greater internet.

It was a comedy of errors. The provider in question announced their more specific routes to an upstream customer, who in turn announced them to Verizon. After that all bets are off. Because those routes were more specific than the aggregates they became the preferred routes. And when the whole world beats a path to your door to get to the rest of the world, you see issues.

Those issues caused CloudFlare to go offline. And when CloudFlare goes offline everyone starts having issues. The sites they are the front end for go offline, even if the path to those sites is still valid. That’s because CloudFlare is acting as the front end for your site when you use their service. It’s great because it means that when someone knocks your system offline or hits you with a ton of traffic you’re safe because CloudFlare can support a lot more bandwidth than you can, especially if you’re self hosted. But if CloudFlare is out, you’re out of luck.

There was a pretty important lesson to be learned in all this and CloudFlare did an okay job of explaining some of those lessons. But the tone of their article was a bit standoffish and seemed to imply that the people whose responsibility it was to keep the Internet running should do a better job of keeping their house in order. For those of you playing along at home, you’ll realize that the irony overlords were immediately summoned to mete out justice to CloudFlare.

Irregular Expression

On July 2nd, CloudFlare went down again. This time, instead of seeing issues with routing packets or delays, users of the service were greeted with 502 Bad Gateway errors. Again, when CloudFlare is down your site is down even if you’re not offline. And then the speculation started. Was this another BGP hijack? Was CloudFlare being attacked? No one knew and most of the places you could go look were offline, including one of the biggest offline site detectors, which was a user of CloudFlare services.

CloudFlare [eventually posted a blog owning up to the fact that it wasn’t an attack or a BGP issue, but instead was the result of a bad web application firewall (WAF) rule being deployed globally in one go. A single regular expression (regex) was responsible for spiking the CPU utilization of the entirety of the CloudFlare network. And when all your CPUs are cranking along at 100% utilization across the board, you are effectively offline.

In the post-mortem CloudFlare had to eat a little crow and admit that their testing procedures for catching this particular issue were inadequate. To see the stance they took with Verizon and Noction just a week or so before and then to see how they had to admit that this one was all on them was a bit humbling for sure. But, more importantly, it shows that you have to be vigilant in every part of your organization to ensure that some issue that you deploy isn’t going to cause havoc on the other side. Especially if you’re the responsible party of a large percentage of traffic on the web.


Tom’s Take

I think CloudFlare is doing good work with their services. But I also think that too many people are relying on them to provide services that should be planned out and documented. It’s important to realize that no one service is going to provide all the things you need to stay resilient. You need to know how you’re keeping your site online and what your backup plan is when things go down.

And, if you’re running one of those services, you’d better be careful about running your mouth on the Internet.

Advertisements

Reclaiming 1.1.1.1 For The Internet

Hopefully by now you’ve seen the announcement that CloudFlare has opened a new DNS service at the address of 1.1.1.1. We covered a bit of it on this week’s episode of the Gestalt IT Rundown. Next to Gmail, it’s probably the best April Fool’s announcement I’ve seen. However, it would seem that the Internet isn’t quite ready for a DNS resolver service that’s easy to remember. And that’s thanks in part to the accumulation of bad address hygiene.

Not So Random Numbers

The address range of 1/8 is owned by APNIC. They’ve had it for many years now but have never announced it publicly. Nor have they ever made any assignments of addresses in that space to clients or customers. In a world where IPv4 space is at a premium, why would a RIR choose to lose 16 million addresses?

Edit: As pointed out by Dale Carder of ES.net in a comment below, APNIC has been assigning address space out of 1 /8 since 2010. However, the most commonly leaked prefixes in that subnet that are difficult to assign because of bogus announcements come from 1.0.0.0/14.

As it turns out, 1/8 is a pretty bad address space for two reasons. 1.1.1.1 and 1.2.3.4. These two addresses are responsible for most of the inadvertent announcements in the entire 1/8 space. 1.2.3.4 is easy to figure out. It’s the most common example IP address given when talking about something. Don’t believe me? Google is your friend. Instead of using 192.0.2.0/24 like we should be using, we instead use the most common PIN, password, and luggage combination in the world. But, at least 1.2.3.4 makes sense.

Why is 1.1.1.1 so popular? Well, the first reason is thanks to Airespace wireless controllers. Airespace uses 1.1.1.1 as the default virtual interface address for just about everything. Here’s a good explanation from Andrew von Nagy. When Airespace was sold to Cisco, this became a very popular address for Cisco wireless networks. Except now that it’s in use as a DNS resolver there are issues with using it. The wireless experts I’ve talked to recommend changing that address to 192.0.2.1, since that address has been marked off for examples only and will never be globally routable.

The other place where 1.1.1.1 seems to be used quite frequently is in Cisco ASA failover interfaces. Cisco documentation recommended using 1.1.1.1 for the primary ASA failover and 1.1.1.2 as the secondary interface. The heartbeats between those two interfaces were active as long as the units were paired together. But, if they were active and reachable then any traffic destined for those globally routable addresses would be black holed. Now, ASAs should probably be using 192.0.2.1 and 192.0.2.2 instead. But beware that this will likely require downtime to reconfigure.

The 1.1.1.1 address confusion doesn’t stop there. Some systems like Nomadix use them as the default logout address. Vodafone used to use it as an image caching server. ISPs are blocking it upstream in some ACLs. Some security organizations even recommend dropping traffic to 1/8 as a bogon prevention measure. There’s every chance that 1.1.1.1 is going to be dropped by something in your network or along the transit path.

Planning Not To Fail

So, how are you going to proceed if you really, really want to use CloudFlare DNS? Well, the first step is to make sure that you don’t have 1.1.1.1 configured anywhere in your network. That means checking WLAN controllers, firewalls, and example configurations. Odds are good you’re running RFC1918 space. But you should try to ping 1.1.1.1 anyway. If you can ping it, then you should traceroute the address. If the traceroute leaves your local network, you probably have a good path.

Once you’ve determined that you’re capable of reaching 1.1.1.1, you need to test it first. Configure it on a test machine or VM and make sure it’s actually resolving addresses for you. Better safe than sorry. Once you know it’s really working like you want it to work, configure it on your internal DNS servers as a forwarder. You still want internal control of DNS thanks to things like Active Directory. But configuring it as a forwarder means you can take advantage of all the features CloudFlare is building into the system while still retaining anything you’ve done locally.


Tom’s Take

I’m glad CloudFlare and APNIC are reclaiming 1.1.1.1 for some useful purpose. CloudFlare can take the traffic load of all the horribly misconfigured systems in the world. APNIC can use this setup to do some analytics work to find out exactly how screwed up things are. I wouldn’t be shocked to see something similar happen to 1.2.3.4 in the future if this bet pays off.

I’ve been using 1.1.1.1 since April 2nd and it works. It’s fast and hasn’t broken yet, which is the best that you can hope for from a DNS server. I’m sure I’ll play around with some of the advanced features as they come online but for now I’m just happy that one of the most recognizable IP addresses in the world is working for me.

Memcached DDoS – There’s Still Time to Save Your Mind

In case you haven’t heard, there’s a new vector for Distributed Denial of Service (DDoS) attacks out there right now and it’s pretty massive. The first mention I saw this week was from Cloudflare, where they details that they were seeing a huge influx of traffic from UDP port 11211. That’s the port used by memcached, a database caching system.

Surprisingly, or not, there were thousands of companies that had left UDP/11211 open to the entire Internet. And, by design, memcached responds to anyone that queries that port. Also, carefully crafted packets can be amplified to have massive responses. In Cloudflare’s testing they were able to send a 15 byte packet and get a 134KB response. Given that this protocol is UDP and capable of responding to forged packets in such a way as to make life miserable for Cloudflare and, now, Github, which got blasted with the largest DDoS attack on record.

How can you fix this problem in your network? There are many steps you can take, whether you are a system admin or a network admin:

  • Go to Shodan and see if you’re affected. Just plug in your company’s IP address ranges and have it search for UDP 11211. If you pop up, you need to find out why memcached is exposed to the internet.
  • If memcached isn’t supposed to be publicly available, you need to block it at the edge. Don’t let anyone connect to UDP port 11211 on any device inside your network from outside of it. That sounds like a no-brainer, but you’d be surprised how many firewall rules aren’t carefully crafted in that way.
  • If you have to have memcached exposed, make sure you talk to that team and find out what their bandwidth requirements are for the application. If it’s something small-ish, create a policer or QoS policy that rate limits the memcached traffic so there’s no way it can exceed that amount. And if that amount is more than 100Mbit of traffic, you need to have an entirely different discussion with your developers.
  • From Cloudflare’s blog, you can disable UDP on memcached on startup by adding the -U 0 flag. Make sure you check with the team that uses it before you disable it though before you break something.

Tom’s Take

Exposing unnecessary services to the Internet is asking for trouble. Given an infinite amount of time, a thousand monkeys on typewriters will create a Shakespearean play that details how to exploit that service for a massive DDoS attack. The nature of protocols to want to help make things easier doesn’t make our jobs easier. They respond to what they hear and deliver what they’re asked. We have to prevent bad actors from getting away with things in the network and at the system level because application developers rarely ask “may I” before turning on every feature to make users happy.

Make sure you check your memcached settings today and immunize yourself from this problem. If Github got blasted with 1.3Tbps of traffic this week there’s no telling who’s going to get hit next.