Applications are king. Forget all the things you do to ensure proper routing in your data center. Forget the tweaks for OSPF sub-second failover or BGP optimal path selection. None of it matters to your users. If their login to Seibel or Salesforce or Netflix is slow today, you’ve failed. They are very vocal when it comes to telling you how much the network sucks today. How do we fix this?
Pathways Aren’t Perfect
The first problem is the cloud focus of applications. Once our packets leave our border routers it’s a giant game of chance as to how things are going to work next. The routing protocol games that govern the Internet are tried and true and straight out of RFC 1771(Yes, RFC 4271 supersedes it). BGP is a great tool with general purpose abilities. It’s becoming the choice for web scale applications like LinkedIn and Facebook. But it’s problematic for Internet routing. It scales well but doesn’t have the ability to make rapid decisions.
The stability of BGP is also the reason why it doesn’t react well to changes. In the old days, links could go up and down quickly. BGP was designed to avoid issues with link flaps. But today’s links are less likely to flap and more likely to need traffic moved around because of congestion or other factors. The pace that applications need to move traffic flows means that they tend to fight BGP instead of being relieved that it’s not slinging their traffic across different links.
BGP can be a good suggestion of path variables. That’s how Facebook uses it for global routing. But decisions need to be made on top of BGP much faster. That’s why cloud providers don’t rely on it beyond basic connectivity. Things like load balancers and other devices make up for this as best they can, but they are also points of failure in the network and have scalability limitations. So what can we do? How can we build something that can figure out how to make applications run better without the need to replace the entire routing infrastructure of the Internet?
GPS For Routing
One of the things that has some potential for fixing inefficiency with BGP and other basic routing protocols was highlighted during Networking Field Day 12 during the presentation from Teridion. They have a method for creating more efficiency between endpoints thanks to their agents. Founder Elad Rave explains more here:
I like the idea of getting “traffic conditions” from endpoints to avoid congestion. For users of cloud applications, those conditions are largely unknown. Even multipath routing confuses tried-and-true troubleshooting like traceroute. What needs to happen is a way to collect the data for congestion and other inputs and make faster decisions that aren’t beholden to the underlying routing structure.
Overlay networking has tried to do this for a while now. Build something that can take more than basic input and make decisions on that data. But overlays have issues with scaling, especially past the boundary of the enterprise network. Teridion has potential to help influence routing decisions in networks outside your control. Sadly, even the fastest enterprise network in the world is only as fast as an overloaded link between two level 3 interconnects on the way to a cloud application.
Teridion has the right idea here. Alternate pathways need to be identified and utilized. But that data needs to be evaluated and updated regularly. Much like the issues with Waze dumping traffic into residential neighborhoods when major arteries get congested, traffic monitors could cause overloads on alternate links if shifts happen unexpectedly.
The other reason why I like Teridion is because they are doing things without hardware boxes or the need to install software anywhere but the end host. Anyone working with cloud-based applications knows that the provider is very unlikely to provide anything outside of their standard offerings for you. And even if they manage, there is going to be a huge price tag. More often than not, that feature request will become a selling point for a new service in time that may be of marginal benefit until everyone starts using it. Then application performance goes down again. Since Teridion is optimizing communications between hosts it’s a win for everyone.
I think Teridion is on to something here. Crowdsourcing is the best way to gather information about traffic. Giving packets a better destination with shorter travel times means better application performance. Better performance means happier users. Happier users means more time spent solving other problems that have symptoms that aren’t “It’s slow” or “Your network sucks”. And that makes everyone happier. Even grumpy old network engineers.
Teridion was a presenter during Networking Field Day 12 in San Francisco, CA. As a participant in Networking Field Day 12, my travel and lodging expenses were covered by Tech Field Day for the duration of the event. Teridion did not ask for nor where they promised any kind of consideration in the writing of this post. My conclusions here represent my thoughts and opinions about them and are mine and mine alone.
Just nail two diverse connections to the end provider, AWS or whomever. Choose determinism over a probabilistic result. It’s a cost of doing business.
More than one provider? Then the conversation gets more complex, but not unsolvable.