There was an interesting article last week from Fastly talking about using BGP to scale their network. This was but the latest in a long line of discussions around using BGP as a transport protocol between areas of the data center, even down to the Top-of-Rack (ToR) switch level. LinkedIn made a huge splash with it a few months ago with their Project Altair solution. Now it seems company after company is racing to implement BGP as the solution to their transport woes. And all because developers have finally pulled their heads out of the sand.
BGP Under Every Rock And Tree
BGP is a very scalable protocol. It’s used the world over to exchange routes and keep the Internet running smoothly. But it has other power as well. It can be extended to operate in other ways beyond the original specification. Unlike rigid protocols like RIP or OSPF, BGP was designed in part to be extended and expanded as needs changes. IS-IS is a very similar protocol in that respect. It can be upgraded and adjusted to work with both old and new systems at the same time. Both can be extended without the need to change protocol versions midstream or introduce segmented systems that would run like ships in the night.
This isn’t the first time that someone has talked about running BGP to the ToR switch either. Facebook mentioned in this video almost three years ago. Back then they were solving some interesting issues in their own data center. Now, those changes from the hyperscale world are filtering into the real world. Networking teams are seeking to solve scaling issues without resorting to overlay networks or other types of workarounds. The desire to fix everything wrong with layer 2 has led to a revelation of sorts. The real reason why BGP is able to work so well as a replacement for layer 2 isn’t because we’ve solved some mystical networking conundrum. It’s because we finally figured out how to build applications that don’t break because of the network.
Apps As Far As The Eye Can See
The whole reason when layer 2 networks are the primary unit of data center measurement has absolutely nothing to do with VMware. VMware vMotion behaves the way that it does because legacy applications hate having their addresses changed during communications. Most networking professionals know that MAC addresses have a tenuous association to IP addresses, which is what allows the gratuitous ARP after a vMotion to work so well. But when you try to move an application across a layer 3 boundary, it never ends well.
When web scale companies started building their application stacks, they quickly realized that being pinned to a particular IP address was a recipe for disaster. Even typical DNS-based load balancing only seeks to distribute requests to a series of IP addresses behind some kind of application delivery controller. With legacy apps, you can’t load balance once a particular host has resolved a DNS name to an IP address. Once the gateway of the data center resolves that IP address to a MAC address, you’re pinned to that device until something upsets the balance.
Web scale apps like those built by Netflix or Facebook don’t operate by these rules. They have been built to be resilient from inception. Web scale apps don’t wait for next hop resolution protocols (NHRP) or kludgy load balancing mechanisms to fix their problems. They are built to do that themselves. When problems occur, the applications look around and find a way to reroute traffic. No crazy ARP tricks. No sly DNS. Just software taking care of itself.
The implications for network protocols are legion. If a web scale application can survive a layer 3 communications issue then we are no longer required to keep the entire data center as a layer 2 construct. If things like anycast can be used to pin geolocations closer to content that means we don’t need to worry about large failover domains. Just like Ivan Pepelnjak (@IOSHints) says in this post, you can build layer 3 failure domains that just work better.
BGP can work as your ToR strategy for route learning and path selection because you aren’t limited to forcing applications to communicate at layer 2. And other protocols that were created to fix limitations in layer 2, like TRILL or VXLAN, become an afterthought. Now, applications can talk to each other and fail back and forth as they need to without the need to worry about layer 2 doing anything other than what it was designed to do: link endpoints to devices designed to get traffic off the local network and into the wider world.
One of the things that SDN has promised us is a better way to network. I believe that the promise of making things better and easier is a noble goal. But the part that has bothered me since the beginning was that we’re still trying to solve everyone’s problems with the network. We don’t rearrange the power grid every time someone builds a better electrical device. We don’t replumb the house overtime we install a new sink. We find a way to make the new thing work with our old system.
That’s why the promise of using BGP as a ToR protocol is so exciting. It has very little to do with networking as we know it. Instead of trying to work miracles in the underlay, we build the best network we know how to build. And we let the developers and programmers do the rest.