BGP Hell Is Other People


If you configure a newsreader to alert you every time someone hijacks a BGP autonomous system (AS), it will probably go off at least once a week. The most recent one was on the first of April courtesy of Rostelecom. But they’re not the only one. They’re just the latest. The incidences of people redirecting BGP, either by accident or by design, are becoming more and more frequent. And as we rely more and more on things like cloud computing and online applications to do our daily work and live our lives, the impact of these hijacks is becoming more and more critical.

Professional-Grade Protocol

BGP isn’t the oldest thing on the Internet. RFC 1105 is the initial draft of Border Gateway Protocol. The version that we use today, BGP4, is documented in RFC 4271. It’s a protocol that has enjoyed a long history of revisions and a reviled history of making networking engineers’ lives difficult. But why is that? How can a routing protocol be so critical and yet obtuse?

My friend Marko Milivojevic famously stated in his CCIE training career that, “BGP isn’t a routing protocol. It’s a policy engine.” When you look at the decisions of BGP in this light it makes a lot more sense. BGP isn’t necessarily concerns with the minutia of figuring out exactly how to get somewhere. Sure, it has a table of prefixes that it uses to make decisions about where to forward packets. Almost every protocol does this. But BGP is different because it’s so customizable.

Have you ever tried to influence a route in RIP or OSPF? It’s not exactly easy. RIP is almost impossible to manipulate outside of things like route poisoning or just turning off interfaces. Sometimes the simplest things are the most hardened. OSPF gives us a lot more knobs to play with, like interface bandwidth and link delay. We can tweak and twerk those values to our heart’s content to make packets flow a certain direction. But we don’t have a lot of influence outside of a specific area for those values. If you’ve ever had to memorize the minutia of OSPF not-so-stubby-areas, ASBRs, and the different between Type 5 and Type 7 LSAs you know that these topics were all but created for certification exams.

But what about BGP? How can you influence routes in BGP? Oh, man! How much time do you have??? We can manipulate things all sorts of ways!

  • Weight the routes to prefer one over another
  • Set LOCAL_PREFERENCE to pick which route to use in a multiple exit system
  • Configure multi-exit discriminator (MED) values
  • AS Path Prepending to reduce the likelihood of a path being chosen
  • Manipulate the underlying routing protocol to make certain routes look more preferred
  • Just change the router ID to something crazy low to break all the other ties in the system

That’s a lot of knobs! Why on earth would you do that to someone? Because professionals need options.

Optional Awfulness

BGP is one of those curious things that seems to be built without guardrails because it’s never used on accident. You’ve probably seen something similar in the real world whenever a person removes a safety feature or modifies a device to increase performance and remove an annoyance designed to slow them down. It could be anything from wrapping a bandana around a safety switch lockout to keep something running to pulling the trigger guard off a nail gun so you don’t keep hitting it with your fingers. Some professionals believe that safety features aren’t keeping them safe as much as they are slowing them down. Something as simple as removing the safety from a pellet gun can have dire consequences in the name of efficiency.

So, how does this apply to our new favorite policy engine that happens to route packets? Well, it applies quite a bit. There is no system of guardrails that keeps you from making dumb choices. Accidentally paste your own AS into the AS Path? That’s going to be a routing decision that is considered. Make a typo for an AS that doesn’t exist in the routing table? That’s going into the formula, too. Announcing to the entire world you have the best path to an AS somewhere on the other side of the world? BGP is happy to send traffic your way.

BGP assumes that professionals are programming it. Which means it’s not going to try and stop you from shooting off your own foot. And because the number of knobs that are exposed by the engine are large and varied you can spend a lot of time trying to troubleshoot just how half of a cloud provider’s traffic came barreling through your network for the last hour. CCIEs spend a lot of time memorizing BGP path selection because every step matters when trying to figure out why BGP is acting up. Likewise, knowing where the knobs are best utilized means knowing how to influence path selection. AS Path prepending is probably the best example of this. If you want to put that AS number in there a hundred times to influence path selection you go for it. Why? Because it’s something you can do. Or, more explicitly, something you aren’t prohibited from doing.

Which leads to the problem of route hijacking. BGP is going to do what you tell it to do because it assumes you’re not trying to do anything nefarious or stupid when you program it. Like an automation script, BGP is going to do whatever it is instructed to do by the policy engine as quickly as possible. Taking out normal propagation delays, BGP will sync things up within a matter of minutes. Maybe a few hours. Which means it’s not hard to watch a mistake cascade through the Internet. Or, in the case of people that are doing less-than-legal things, to watch the fruits of your labors succeed.

BGP isn’t inherently bad any more than claiming a catwalk without a handrail has an evil intent. Yes, the situation you find yourself in is less-than-ideal. Sure, something bad can happen if you screw up or do something you’re not supposed to. But blaming the protocol or the object or the situation is not going to fix the issue. We really only have two options at this point:

  • Better educate our users and engineers about how to use BGP and ensure that only qualified people are able to work on it
  • Create controls in BGP that limit the ability to use certain knobs and options in order to provide more security and reliability options.

Tom’s Take

I’m a proponent of both of those options. We need to ensure that people have the right training. However, we also need to ensure that nefarious actors are locked out and that we are protected from making dumb mistakes or that our errors aren’t propagated at light speed through the dark corners of the Internet. We can’t fix everything wrong with BGP but it’s the best option we have right now. Hellish though it may be, we have to find a way to make a better combination of the protocol and the people that use it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s