The Puzzle of Peering with Kentik

If you’ve worked at an ISP or even just closely with them you’ve probably hearing the term peering quite a bit. Peering is essentially a reciprocal agreement to provide access to networks between two providers. Provider A agrees to allow Provider B to send traffic over and through their network in exchange for the same access in the other direction. Sounds easy, right? On a technical level it is pretty easy. You simply set up a BGP session with the partner provider and make sure all the settings match and you’ve got things rolling.

The technical part isn’t usually where peering gets complicated. Instead it’s almost always related to the business side of things. The policy and negations that have to happen for a good peering agreement take way more time that hammering out some BGP configuration stanzas. The amount of traffic to be sent, the latency requirements, and even the cost of the agreement are all things that have to be figured out before the first hello packet can be exchanged. This agreement is always up for negotiation too, since the traffic patterns can change before you realize it and put you at a disadvantage.

Peerless Data Collection

If you want to get the most out of your peering arrangement you need to know what’s going on. You need to have statistics about the key points of your agreement. You need to know if you’re holding up your end of the bargain as well as the company you’re working with. If you walk into a peering negotiation without the right data you’re going to be working from a disadvantage right away.

For example, did your partner company take all of the traffic they agreed to accept in the peering contract? Or did they have issues that forced you to send the traffic along a different route? Were you forced to send that traffic across a different route that had a higher cost? Did your users complain about network speed because of congestion outside of your control? If you can’t put your fingers on the answers to these questions quickly you’re going to find yourself with lots of angry users and customers not to mention peering partners that want answers from you as well.

Recently I had a chance to listen to a great presentation from Kentik during Networking Field Day: Service Provider. Nina Bargisten laid out some of the challenges that Kentik customers face with peering arrangements and how Kentik is helping to solve them:

One of the points that Nina discusses is that capacity planning is a huge undertaking for ISPs. With the supply chain issues that we’re currently facing in 2022 it’s not easy to order equipment to alleviate congestion problems. Even under somewhat normal circumstances it’s not likely that an ISP is going to go out and order a lot of new hardware just to deal with congestion. They might change some polices to route traffic in different directions but ultimately the decision has to be made about how to get customer packets through and out of their network to the ultimate destination.

Peering agreements can help with congestion. Adding more exit points to your network means some flows can exit through a different provider and either get to their destination faster or prevent a larger connection from being overwhelmed and congested. It’s not unlike having multiple options to use to arrive a destination when driving. Some streets are better for smaller amounts of traffic compared to larger highways and interstates that provide high-speed travel.

As mentioned above, it’s critical that you have data on your traffic and its performance. Are you sending everything through one route? Are you peering with providers that are getting less than half of the traffic load they agreed to take? These are all questions you have to ask to create a capacity plan. If you’re hearing complaints about congestion but you see that only two of your outbound connections are running a full capacity while the rest are sitting idle then you don’t have a congestion issue as much as you have a configuration problem to solve.

Kentik’s solution allows you to see what’s going on and help you make better decisions about the routes that traffic should be taking. As demonstrated above, their dashboard collects data from your network as well as many others and can tell you when you need to be configuring polices to send traffic to low volume peers instead of relying on congested links. It will also help you see trends for when links become congested and allow you to set thresholds to divide your traffic appropriately before it becomes an issue.

There’s a lot more info in the video above to help you with your capacity planning and peering negotiations. It all comes down to a simple maxim: Information is key. You can’t solve these puzzles without knowing what you have and what you need. If you’re just going to keep throwing peering agreements at a problem until it goes away you’re going to fail. You won’t solve your real issues by just adding another connection that never gets used. Instead you can use Kentik’s platform to provide the kinds of insights that will help you create value for your customers and save money at the same time.


Tom’s Take

Service providers think about traffic differently than enterprise admins. They have to worry about it coming into the network and leaving again. Instead of worrying about a couple of links to the wider Internet they have to worry about dozens. If you think it’s hard keeping track of all that data for the enterprise you can just imagine how hard it is when you scale it up to the service provider level. Thankfully companies like Kentik are applying their expertise to provide actionable information to help you make the right choices and maybe even negotiate some better deals.

If you’d like to learn more about this presentation, make sure you check out the full presentation on the Tech Field Day site or go to http://Kentik.com

The Value of Old Ideas

I had a fun exchange on Twitter this week that bears some additional thinking. Emirage (@Emirage6) tweeted a fun meme about learning BGP:

I retweeted it and a few people jumped in the fun, including a couple that said it was better to configure BGP for reasons. This led to a blog post about routing protocols with even more great memes and a good dose of reality for anyone that isn’t a multi-CCIE.

Explain It Like I’m Five

I want you to call your mom and explain BGP to her. Go on and do that now because I’m curious to see how you’d open that conversation. Unless your mom is in networking already I’m willing to bet you’re going to have to start really, really basic. In fact, given the number of news organizations that don’t even know what the letters in the acronym stand for I’d guess you are going to have a hard time talking about the path selection process or leak maps or how sessions are established.

Now, try that same conversation with RIP. I bet it goes a lot smoother. Why? Because RIP is a simple protocol. We don’t worry about prefixes or AS Path prepending or other things when we talk about RIP. Because RIPv1 is the most basic routing protocol you can get. It learns about routes and sends the information on. It’s so simple that you can usually get all of the info about it out of the way in a couple of hours of instruction in a networking class.

So why do we still talk about RIP? It’s so old. Even the second version of the protocol is better. There’s no intelligence. No link state. You can’t even pick a successor route! We should never talk about RIPv1 again in any form, right? So what would you suggest we start with? OSPF? BGP?

The value of RIP is not in the protocol. Instead, the value is in the simplicity of the discussion. If you’ve never heard of a routing protocol before you don’t want to hear about all the complexity of something that runs half the Internet. At least you don’t want to hear about it at first. What you need to hear is how a routing protocol works. That’s why RIP is a great introduction. It does the very minimal basic functions that a routing protocol should do. It learns routes and tells other routers about them.

RIP is also a great way to explain why other routing protocols were developed. We use OPSF and EIGRP and other IGRPs now because RIP doesn’t perform well outside of small networks. There are limitations around IP subnets and network diameter and “routing by rumor” that only make sense when you know how routing protocols are supposed to operate and how they can fall down. If you start with OSPF and learn about link-state first then you don’t understand how a router could ever have knowledge of a route that it doesn’t know about directly or learn about from a trusted source. In essence, RIP is a great first lesson because it is both bad and good.

The Expert’s Conundrum

The other issue at hand is that experts tend to feel like complicated subjects they understand are easy to explain. If that’s the case, then the Explain Like I’m Five Reddit shouldn’t exist. It turns out that trying to explain a complex topic to people without expert knowledge is super hard because they lack the frame or reference you need to help them understand. If you are a network engineer that doesn’t believe me then go ask a friend in the medical profession to explain how the endocrine system works. Don’t make they do a simple explanation. Make them tell you like they’d teach a medical student.

We lose sight of the fact that complex topics can’t be mastered quickly and certain amounts of introductory knowledge need to be introduced to bring people along on the journey. We teach about older models and retired protocols because the knowledge they contain can help us understand why we moved away from them. It also helps us to have a baseline level of knowledge about things that permeate the way we do our jobs.

If we removed things that we never use from the teaching curriculum we’d never have the OSI model since it is never implemented in the pure form in any protocol. We’d teach the TCP/IP model and just tell people that the other things don’t really matter. In fact, they do matter because the pure OSI model takes other things into account that aren’t just focused on the network protocol stack. It may seem silly to us as experts to say that we teach one thing but reality looks much different but that’s how learning works.

We still teach older methods of network topologies like bus and ring even though we don’t use them much any more. We do this because entry-level people need to know why we arrived at the method we use now. Even with the move toward new setups like 3-tier network design and leaf/spine architecture you need to know where Ethernet started to understand why we are where we are today.


Tom’s Take

It’s always important to get a reminder that people are just starting out in the learning process. While it’s easy to be an expert and sit back and claim that iBGP is the best way to approach a routing protocol problem you also have to remember that learners sometimes just need a quick-and-dirty lab setup to test another function. If they’re going to spend hours configuring BGP neighbor relationships instead of just enabling RIP on all router interfaces and moving on to the next part then they’re not learning the right things. Knowledge is important, even if it’s outdated. We still teach RIP and Frame Relay and Token Ring in networking because people need to understand how they operate and why we’ve moved on. They may not ever configure them in practice but they may also never configure BGP either. The value of information doesn’t decrease because it’s old.

The Network Does Too Much

I’m at Networking Field Day this week and it’s good to be back in person around other brilliant engineers and companies. One of the other fun things that happens at Networking Field Day is that I get to chat with folks that help me think about things in new ways and come up with awesome ideas for networking blog posts.

One of the ones that was discussed quickly this week really got me thinking again about fragility and complexity. Thanks to Carl Fugate for reminding me about it. Essentially, networks are inherently unstable because they are doing far too much heavy lifting.

Swiss Army Design

Have you heard about the AxeSaw Reddit? It’s a page dedicated to finding silly tools that attempt to combine too many things together into one package that make the overall tool less useful. Like making a combination shovel and axe that isn’t easy to operate because you have to hold on to the shovel scoop as the handle for the axe and so on. It’s a goofy take on a trend of trying to make things too compact at the sake of usability.

Networking has this issue as well. I’ve talked about it before here but nothing has really changed in the intervening time since that post five years ago. The developers that build applications and systems still rely on the network to help solve challenges that shouldn’t be solved in the network. Things like first hop reachability protocols (FHRP) are perfect examples. Back in the dark ages, systems didn’t know how to handle what happened when a gateway disappeared. They knew they needed to find a new one eventually when the MAC address age timed out. However, for applications that couldn’t wait there needed to be a way to pick up the packets and keep them from timing out.

Great idea in theory, right? But what if the application knew how to handle that? Things like Cisco CallManager have been able to designate backup servers for years. Applications built for the cloud know how to fail over properly and work correctly when a peer disappears or a route to a resource fails. What happened? Why did we suddenly move from a model where you have to find a way to plan for failure with stupid switching tricks to a world where software just works as long as Amazon is online?

The answer is that we removed the ability for those stupid tricks to work without paying a high cost. You want to use Layer 2 tricks to fake NHRP? If it’s even available in the cloud you’re going to be paying a fortune for it. AWS wants you to use their tools that they optimize for. If you want to do things the way you’ve always done you can but you need to pay for that privileges.

With cost being the primary driver for all things, increased costs for stupid switching tricks have now given way to better software development. Instead of paying thousands of dollars a month for a layer 2 connection to run something like HSRP you can instead just make the application start searching for a new server when the old one goes away. You can write it to use DNS instead of IP so you can load balance or load share. You can do many, many exciting and wonderful things to provide a better experience that you wouldn’t have even considered before because you just relied on the networking team to keep you from shooting yourself in the foot.

Network Complexity Crunch

If the cloud forces people to use their solutions for reliability and such, that means the network is going to go away, right? It’s the apocalypse for engineer jobs. We’re all going to get replaced by a DevOps script. And the hyperbole continues on.

Reality is that networking engineers will still be as needed as highway engineers are even though cars are a hundred times safer now than in the middle of the 20th century. Just because you’ve accommodated for something that you used to be forced to do doesn’t mean you don’t need to build the pathway. You still need roads and networks to connect things that want to communicate.

What it means for the engineering team is an increased focus on providing optimal reliable communications. If we remove the need to deal with crazy ARP tricks and things like that we can focus on optimizing routing to provide multiple paths and ensure systems have better communications. We could even do crazy things like remove our reliability of legacy IP because the applications will survive a transition when they aren’t beholden to IP address or ARP to prevent failure.

Networking will become a utility. Like electricity or natural gas it won’t be visible unless it’s missing. Likewise, you don’t worry about the utility company solving issues about delivery to your home or business. You don’t ask them to provide backup pipelines or creative hacks to make it work. You are handed a pipeline that has bandwidth and the service is delivered. You don’t feel like the utility companies are outdated or useless because you’re getting what you pay for. And you don’t have to call them every time the heater doesn’t work or you flip a breaker. Because that infrastructure is on your side instead of theirs.


Tom’s Take

I’m ready for a brave new world where the network is featureless and boring. It’s an open highway with no airbags along the shoulder to prevent you from flying off the road. No drones designed to automatically pick you up and put you on the correct path to your destination if you missed your exit. The network is the pathway, not the system that provides all the connection. You need to rely on your systems and your applications to do the heavy lifting. Because we’ve finally found a solution that doesn’t allow the networking team to save the day we can absolutely build a world where the clients are responsible for their own behavior. The network needs to do what it was designed to do and no more. That’s how you solve complexity and fragility. Less features means less to break.

IP Class is Now in Session

You may have seen something making the rounds on Twitter this week about a couple of proposed drafts designed to alleviate the problems with IPv4 exhaustion by repurposing some old IP spaces that aren’t available for use right now. Specifically:

Ultimately, this is probably going to fail for a variety of reasons and looks like it’s more of a suggestion than anything else but I wanted to take a moment to talk about why this isn’t an effective way of fixing address issues.

Error Bearers

The first reason that the Schoen drafts are going to fail is because most of the operating systems in the world won’t allow you to use reserved spaces for a system address. Because we knew years ago that certain spaces were marked as non-usable the logic was configured into the system to disallow the use of those spaces. And even if the system isn’t configured to disallow that space there’s no guarantee the traffic is going to be transmitted.

Let’s take 127/8 as a good example. Was it a smart idea to mark 16 million addresses as loopback host-only space? Nope. But that ship has sailed and we aren’t going to be able to easily fix it. Too many systems will see any address starting with 127 in first octet and assume it’s a loopback address. In much the same way as people have been known to assume the entire 192/8 address space is RFC1918 reserved instead of 192.168.0.0/16. Logic rules and people making decisions aren’t going to trust any space being used in that manner. Even if you did something creative like using NAT and only using it internally you’re not going to be able to patch every version of every operating system in your organization.

We modify rules all the time and then have to spend years updating those modifications. Take area codes in North America for example. The old rules used to say that an area code had to have a zero or a one for the middle digit – ([2-9][0-1][2-9]) to use the Cisco UCM parlance. If your middle digit was something other than a zero or a one it wasn’t a valid NANP area code. As we began to expand the phone system in 1995 we changed those rules and now have area codes with all manner of middle numbers.

What about prefixes? Those follow rules too. NANP prefixes must not start with a zero or a one – (area code) [2-9]XX-XXXX is the way they are coded. Prefixes that start with a zero or a one are invalid and can’t be used. If we suddenly decided that we needed to open up the numbers in existing area codes and include prefixes that start with those forbidden numbers we would need to reset all the dialing rules in systems all over the country. I know that I specifically programmed my CUCM servers to send an immediate error if you dialed a prefix with a zero or a one. And all of them would have to be manually reconfigured for such a change.

In much the same way, the address spaces that are reserved today as invalid would need to be patched out of systems from home computers to phones to networking equipment. And even if you think you got it all you’re going to miss one and wonder why it isn’t working. Worse yet, it might even silently fail because you may be able to transmit data to 95% of the systems out there but some intermediate system may discard your packets as invalid and never tell you what happened. You’ll spend hours or days chasing a problem you may not even be able to fix.

Avoiding the Solutions

The easiest way to look at these proposals is by understanding that people are really, really, really in love with IPv4. Despite the fact that using the effort of the changes necessary to implement these reserved spaces would be better spent on IPv6 adoption we still get these things being submitted. There is a solution but people don’t want to use it. The modern Internet relies so much on the cloud that it would be simple to enable IPv6 in your provider space and use your engineering talent to help provide better adoption for that instead. We’re already seeing that all over places with address space has been depleted for a while now.

It may feel easier to spend more effort to revitalize the IPv4 space we all know and love. It may even feel triumphant when we’re able to reclaim address space that was wasted and use it for something productive instead of just teaching that you can’t configure devices with those spaces. And millions of devices will have IP address space to use, or more accurately there will be millions of addresses available to sell to people that will waste them anyway. Then what?

The short term gain from opening up IPv4 space at the expense of not developing IPv6 adoption is a fallacy that will end in pain. We can keep putting policy duct tape on the IPv4 exhaustion problem but we are eventually going to hit a wall we can’t overcome. The math doesn’t work when your address space is only 32 bits in total. That’s why IPv6 expanded the amount of information in the address space.

Sure, there have been mistakes in the way that IPv6 address space has been allocated and provisioned. Those mistakes would need to eventually be corrected and other configurations would need to be done in order to efficiently utilize the space. Again, the effort should be made to fix problems with a future-proof solution instead of trying our hardest to keep the lights on with the old system that’s falling apart for a few more years.


Tom’s Take

The race to find every last possible way to utilize the IPv4 space is exactly what I expected when we’re in the death throes of using it instead of IPv6. The easy solutions are done. The market and hunger for IPv4 space is only getting stronger. Instead of weaning the consumers off their existing setups and moving them to something future proof we’re feeding their needs for short term gains. If the purpose of this whole exercise was to get more address space to be rationed out for key systems to keep them online longer I might begrudgingly accept it. However, knowing that it would likely be opened up and fed to providers to be auctioned off in blocks to be ultimately wasted means all the extra effort is for no gain. These IETF drafts have a lot of issues and we’re better off letting them expire in May 2022. Because if we take up this cause and try to make them a reality we’re going to have to relearn a lot of lessons of the past we’ve forgotten.

BGP Hell Is Other People

If you configure a newsreader to alert you every time someone hijacks a BGP autonomous system (AS), it will probably go off at least once a week. The most recent one was on the first of April courtesy of Rostelecom. But they’re not the only one. They’re just the latest. The incidences of people redirecting BGP, either by accident or by design, are becoming more and more frequent. And as we rely more and more on things like cloud computing and online applications to do our daily work and live our lives, the impact of these hijacks is becoming more and more critical.

Professional-Grade Protocol

BGP isn’t the oldest thing on the Internet. RFC 1105 is the initial draft of Border Gateway Protocol. The version that we use today, BGP4, is documented in RFC 4271. It’s a protocol that has enjoyed a long history of revisions and a reviled history of making networking engineers’ lives difficult. But why is that? How can a routing protocol be so critical and yet obtuse?

My friend Marko Milivojevic famously stated in his CCIE training career that, “BGP isn’t a routing protocol. It’s a policy engine.” When you look at the decisions of BGP in this light it makes a lot more sense. BGP isn’t necessarily concerns with the minutia of figuring out exactly how to get somewhere. Sure, it has a table of prefixes that it uses to make decisions about where to forward packets. Almost every protocol does this. But BGP is different because it’s so customizable.

Have you ever tried to influence a route in RIP or OSPF? It’s not exactly easy. RIP is almost impossible to manipulate outside of things like route poisoning or just turning off interfaces. Sometimes the simplest things are the most hardened. OSPF gives us a lot more knobs to play with, like interface bandwidth and link delay. We can tweak and twerk those values to our heart’s content to make packets flow a certain direction. But we don’t have a lot of influence outside of a specific area for those values. If you’ve ever had to memorize the minutia of OSPF not-so-stubby-areas, ASBRs, and the different between Type 5 and Type 7 LSAs you know that these topics were all but created for certification exams.

But what about BGP? How can you influence routes in BGP? Oh, man! How much time do you have??? We can manipulate things all sorts of ways!

  • Weight the routes to prefer one over another
  • Set LOCAL_PREFERENCE to pick which route to use in a multiple exit system
  • Configure multi-exit discriminator (MED) values
  • AS Path Prepending to reduce the likelihood of a path being chosen
  • Manipulate the underlying routing protocol to make certain routes look more preferred
  • Just change the router ID to something crazy low to break all the other ties in the system

That’s a lot of knobs! Why on earth would you do that to someone? Because professionals need options.

Optional Awfulness

BGP is one of those curious things that seems to be built without guardrails because it’s never used on accident. You’ve probably seen something similar in the real world whenever a person removes a safety feature or modifies a device to increase performance and remove an annoyance designed to slow them down. It could be anything from wrapping a bandana around a safety switch lockout to keep something running to pulling the trigger guard off a nail gun so you don’t keep hitting it with your fingers. Some professionals believe that safety features aren’t keeping them safe as much as they are slowing them down. Something as simple as removing the safety from a pellet gun can have dire consequences in the name of efficiency.

So, how does this apply to our new favorite policy engine that happens to route packets? Well, it applies quite a bit. There is no system of guardrails that keeps you from making dumb choices. Accidentally paste your own AS into the AS Path? That’s going to be a routing decision that is considered. Make a typo for an AS that doesn’t exist in the routing table? That’s going into the formula, too. Announcing to the entire world you have the best path to an AS somewhere on the other side of the world? BGP is happy to send traffic your way.

BGP assumes that professionals are programming it. Which means it’s not going to try and stop you from shooting off your own foot. And because the number of knobs that are exposed by the engine are large and varied you can spend a lot of time trying to troubleshoot just how half of a cloud provider’s traffic came barreling through your network for the last hour. CCIEs spend a lot of time memorizing BGP path selection because every step matters when trying to figure out why BGP is acting up. Likewise, knowing where the knobs are best utilized means knowing how to influence path selection. AS Path prepending is probably the best example of this. If you want to put that AS number in there a hundred times to influence path selection you go for it. Why? Because it’s something you can do. Or, more explicitly, something you aren’t prohibited from doing.

Which leads to the problem of route hijacking. BGP is going to do what you tell it to do because it assumes you’re not trying to do anything nefarious or stupid when you program it. Like an automation script, BGP is going to do whatever it is instructed to do by the policy engine as quickly as possible. Taking out normal propagation delays, BGP will sync things up within a matter of minutes. Maybe a few hours. Which means it’s not hard to watch a mistake cascade through the Internet. Or, in the case of people that are doing less-than-legal things, to watch the fruits of your labors succeed.

BGP isn’t inherently bad any more than claiming a catwalk without a handrail has an evil intent. Yes, the situation you find yourself in is less-than-ideal. Sure, something bad can happen if you screw up or do something you’re not supposed to. But blaming the protocol or the object or the situation is not going to fix the issue. We really only have two options at this point:

  • Better educate our users and engineers about how to use BGP and ensure that only qualified people are able to work on it
  • Create controls in BGP that limit the ability to use certain knobs and options in order to provide more security and reliability options.

Tom’s Take

I’m a proponent of both of those options. We need to ensure that people have the right training. However, we also need to ensure that nefarious actors are locked out and that we are protected from making dumb mistakes or that our errors aren’t propagated at light speed through the dark corners of the Internet. We can’t fix everything wrong with BGP but it’s the best option we have right now. Hellish though it may be, we have to find a way to make a better combination of the protocol and the people that use it.

SD-WAN and Technical Debt

Back during Networking Field Day 22, I was having a fun conversation with Phil Gervasi (@Network_Phil) and Carl Fugate (@CarlFugate) about SD-WAN and innovation. I mentioned that it was fascinating to see how SD-WAN companies kept innovating but that bigger, more established companies that had bought into SD-WAN seemed to be having issues catching up. As our conversation continued I realized that technical debt plays a huge role in startup culture in all factors, not just with SD-WAN. However, SD-WAN is a great example of technical debt to talk about here.

Any Color You Want In Black

Big companies have investments in supply chains. They have products that are designed in a certain way because it’s the least expensive way to develop the project or it involves using technology developed by the company that gives them a competitive advantage. Think about something like the Cisco Nexus 9000-series switches that launched with Cisco ACI. Every one of them came with the Insieme ASIC that was built to accelerate the policy component of ACI. Whether or not you wanted to use ACI or Insieme in your deployment, you were getting the ASIC in the switch.

Policies like this lead to unintentional constraints in development. Think back five years to Cisco’s IWAN solution. It was very much the precursor to SD-WAN. It was a collection of technologies like Performance Routing (PfR), Application Visibility Control (AVC), Policy Based Routing (PBR), and Network Based Application Recognition (NBAR). If that alphabet soup of acronyms makes you break in hives, you’re not alone. Cisco IWAN was a platform very much market by potential and complexity.

Let’s step back and ask ourselves an important question: “Why?” Why was IWAN so complicated? Why was IWAN hard to deploy? Why did IWAN fail to capture a lot of market share and ride the wave that eventually became SD-WAN? Looking back, a lot of the choices that were made that eventually doomed IWAN can come down to existing technical debt. Cisco is a company that makes design decisions based on what they’ve been doing for a while.

I’m sure that the design criteria for IWAN came down to two points:

  1. It needs to run on IOS.
  2. It needs to be an ISR router.

That doesn’t sound like much. But imagine the constraints you run into with just those two limitations. You have a hardware platform that may not be suited for the kind of work you want to do. Maybe you want to take advantage of x86 chipset acceleration. Too bad. You have to run what’s in the ISR. Which means it could be underpowered. Or incapable of doing things like crypto acceleration for VPNs, which is important for building a mesh of encrypted tunnels. Or maybe you need some flexibility to build a better detection platform for applications. Except you have to use IOS. Which uses NBAR. And anything you write to extend NBAR has to run on their platforms going forward. Which means you need to account for every possible permutation of hardware that IOS runs on. Which is problematic at best.

See how technical debt can creep in from the most simplistic of sources? All we wanted to do was build a platform to connect WANs together easily. Now we’re mired in a years-old hardware choice and an aging software platform that can’t help us do what needs to be done. Is it any wonder why IWAN didn’t succeed in the original form? Or why so many people involved with the first generation of SD-WAN startups were involved with IWAN, even if just tangentially?

Debt-Free Development

Now, let’s look at a startup like CloudGenix, who was a presenter at Networking Field Day 22 and was recently acquired by Palo Alto Networks. They started off on a different path when they founded the startup. They knew what they wanted to accomplish. They had a vision for what would later be called SD-WAN. But instead of shoehorning it into an existing platform, they had the freedom to build what they wanted.

No need to keep the ISR platform? Great. That means you can build on x86 hardware to make your software more universally deployable on a variety of boxes. Speaking of boxes, using commercial off-the-shelf (COTS) equipment means you can buy some very small devices to run the software. You don’t need a system designed to use ATM modules or T1 connections. If all you little system is ever going to use is Ethernet there’s no reason to include expansion at all. Maybe USB for something like a 4G/LTE modem. But those USB ports are baked into the board already.

A little side note here that came from Olivier Huynh Van of Gluware. You know the USB capabilities on a Cisco ISR? Yeah, the ISR chipset didn’t support USB natively. And it’s almost impossible to find USB that isn’t baked into an x86 board today. So Cisco had to add it to the ISR in a way that wasn’t 100% spec-supported. It’s essentially emulated in the OS. Which is why not every USB drive works in an ISR. Take that for what’s it’s worth.

Back to CloudGenix. Okay, so you have a platform you can build on. And you can build software that can run on any x86 device with Ethernet ports and USB devices. That means your software doesn’t need to do complicated things. It also means there are a lot of methods already out there for programming network operating systems for x86 hardware, such as Intel’s Data Plane Development Kit (DPDK). However CloudGenix chose to build their OS, they didn’t need to build everything completely from scratch. Even if they chose to do it there are still a ton of resources out there to help them get started. Which means you don’t have to restart your development every time you need to add a feature.

Also, the focus on building the functions you want into an OS you can bend to your needs means you don’t need to rely on other teams to build pieces of it. You can build your own GUI. You can make it look however you want. You can also make it operate in a manner that is easiest for your customer base. You don’t need to include every knob or button or bell and whistle. You can expose or hide functions as you wish. Don’t want customers to have tons of control over VPN creation or certificate authentication? You don’t need to worry about the GUI team exposing it without your permission. Simple and easy.

One other benefit of developing on platforms without technical debt? It’s easy to port your software from physical to virtual. CloudGenix was already successful in porting their software to run from physical hardware to the cloud thanks to CloudBlades. Could you imagine trying to get the original Cisco IWAN running in a cloud package for AWS or Azure? If those hives aren’t going crazy right now I’m sure you must have nerves or steel.


Tom’s Take

Technical debt is no joke. Every decision you make has consequences. And they may not be apparent for this generation of products. People you may never meet may have to live with your decisions as they try to build their vision. Sometimes you can work with those constraints. But more often than not brilliant people are going to jump ship and do it on their own. Not everyone is going to succeed. But for those that have the vision and drive and turn out something that works the rewards are legion. And that’s more than enough to pay off any debts, technical or not.

AI and Trivia

questions answers signage

Photo by Pixabay on Pexels.com

I didn’t get a chance to attend Networking Field Day Exclusive at Juniper NXTWORK 2019 this year but I did get to catch some of the great live videos that were recorded and posted here. Mist, now a Juniper Company, did a great job of talking about how they’re going to be extending their AI-driven networking into the realm of wired networking. They’ve been using their AI virtual assistant, named “Marvis”, for quite a while now to solve basic wireless issues for admins and engineers. With the technology moving toward the copper side of the house, I wanted to talk a bit about why this is important for the sanity of people everywhere.

Finding the Answer

Network and wireless engineers are walking storehouses of useless trivia knowledge. I know this because I am one. I remember the hello and dead timers for OSPF on NBMA networks. I remember how long it takes BGP to converge or what the default spanning tree bridge priority is for a switch. Where some of my friends can remember the batting average for all first basemen in the league in 1971, I can instead tell you all about LSA types and the magical EIGRP equation.

Why do we memorize this stuff? We live in a world with instant search at our fingertips. We can find anything we might need thanks to the omnipotent Google Search Box. As long as we can avoid sponsored results and ads we can find the answer to our question relatively quickly. So why do we require people to memorize esoteric trivia? Is it so we can win free drinks at the bar after we’re done troubleshooting?

The problem isn’t that we have to know the answer. It’s that we need to know the answer in order to ask the right question. More often than now we find ourselves stuck in the initial phase of figuring out the problem. The results are almost always the same – things aren’t working. Finding the cause isn’t always easy though. We have to find some nugget of information to latch onto in order to start the process.

One of my old favorites was trying to figure out why a network I was working with had a segmented spanning tree. One side of the network was working just fine but there were three switches daisy chained together that didn’t. Investigations turned up very little. Google searches were failing me. It wasn’t until I keyed in on a couple of differences that I found out that I had improperly used a BPDU filtering command because of a scoping issue. Sure, it only took me two hours of searching to find it after I discovered the problem. But if I hadn’t memorized the BDPU filtering and guard commands and their behavior I wouldn’t have even known to ask about them. So it’s super important to know how every minutia of every protocol works, right?

Presenting the Right Questions

Not exactly. We, as human computers, memorize the answers to more efficiently search through our database to find the right answers. If the problem takes 5 minutes to present we can eliminate a bunch of causes. If it’s happening in layer 3 and not layer 2 we can toss out a bunch of other stuff. Our knowledge is allowing us to discard useless possibilities and focus on the right result.

And it’s horribly inefficient. I can attest to that given my various attempts to learn OSPF hello and dead timers through osmosis of falling asleep in my big CCNP Routing book. The answers don’t crawl off the page and into your brain no matter how loudly you snore into it. So I spent hours learning something that I might use two or three times in my career. There has to be a better way.

Not coincidentally, that’s where the AI-driven systems from Mist, and now Juniper, come into play. Marvis is wonderful at looking at symptoms and finding potential causes. It’s what we do as humans. Except Marvis has no inherent biases. It also doesn’t misremember the values for a given protocol or get confused about whether or not OSPF point-to-point networks are broadcast or not. Marvis just knows what it was programmed with. But it does learn.

Learning is the key to how these AI and machine learning (ML) driven systems have to operate. People tend to discount solutions because they think there’s no way it could be that solution this time. For example, a haiku:

It’s not DNS.
Could it be DNS?
It was DNS.

DNS is often the cause of our problems even if we usually discount it out of hand in the first five minutes of troubleshooting. Even if it was only DNS 50% of the time we would still toss DNS as the root cause within the first five minutes because we’ve “trained” our brains to know what a DNS problem looks like without realizing how many things DNS can really affect.

But AI and ML don’t make these false correlations. Instead, they learn every time what the cause was. They can look at the network and see the failure state, present options based on the symptoms, and even if you don’t check in your changes they can analyze the network and figure out what change caused everything to start working again. Now, the next time the problem crops up, a system like Marvis can present you with a list of potential solutions with confidence levels. If DNS is at the top of the list, you might want to look into DNS first.

AI is going to make us all better troubleshooters because it’s going to make us all less reliant on poor memory. Instead of misremembering how a protocol should be configure, AI and ML will tell us how it should look. If something is causing routing loops or if layer 2 issues are happening because of duplex mismatches we’ll be able to see that quickly and have confidence it’s the right answer instead of just guessing and throwing things at the wall until they stick. Just like Google has supplanted the Cliff Claven people at the bar that are storehouses of useless knowledge, so too will AI and ML reduce our dependence on know-it-alls that may not have all the answers.


Tom’s Take

I’m ready to be forgetful. I’m tired of playing “stump the chump” in troubleshooting with the network playing the part of the stumper and me playing the chump. I’ve memorized more useless knowledge than I ever care to recall in my life. But I don’t want to have to do the work any longer. Instead, I want to apply my gifts to training algorithms with more processing power than me to do all the heavy lifting. I’m more than happy to look at DNS and EIGRP timers than try to remember if MTU and reliability are part of the K-values for this network.

The End of SD-WAN’s Party In China

As I was listening to Network Break Episode 257 from my friends at Packet Pushers, I heard Greg and Drew talking about a new development in China that could be the end of SD-WAN’s big influence there.

China has a new policy in place, according to Axios, that enforces a stricter cybersecurity stance for companies. Companies doing business in China or with offices in China must now allow Chinese officials to get into their networks to check for security issues as well as verifying the supply chain for network security.

In essence, this is saying that Chinese officials can have access to your networks at any time to check for security threats. But the subtext is a little less clear. Do they get to control the CPE as well? What about security constructs like VPNs? This article seems to indicate that as of January 1, 2020, there will be no intra-company VPNs authorized by any companies in China, whether Chinese or foreign businesses in China.

Tunnel Collapse

I talked with a company doing some SD-WAN rollouts globally in China all the way back in 2018. One of the things that was brought up in that interview was that China was an unknown for American companies because of the likelihood of changing that model in the future. MPLS is the current go-to connectivity for branch offices. However, because you can put an SD-WAN head-end unit there and build an encrypted tunnel back to your overseas HQ it wasn’t a huge deal.

SD-WAN is a wonderful way to ensure your branches are secure by default. Since CPE devices “phone home” and automatically build encrypted tunnels back to a central location, such as an HQ, you can be sure that as soon as the device powers on and establishes global connectivity that all traffic will be secure over your VPN until you change that policy.

Now, what happens with China’s new policy? All traffic must transit outside of a VPN. Things like web traffic aren’t as bad but what about email? Or traffic destined for places like AWS or Azure? It was an unmentioned fact that using SD-WAN VPNs to transit through the content filters in place in China was a way around issues that might arise from accessing resources inside of a very well secured country-wide network.

With the policy change and enforcement guidelines set forth to be enacted in 2020, this could be a very big deal for companies hoping to use SD-WAN in China. First and foremost, you can’t use your intra-company VPN functions any longer. That effectively means that your branch office can’t connect to the HQ or the rest of your corporate network. Given some of the questions around intellectual property issues in China that might not be a bad thing. However, it is going to cause issues for your users trying to access the mail and other support services. Especially if they are hosted somewhere that is going to create additional scrutiny.

The other potential issue is whether or not Chinese officials are even going to allow you to use CPE of your own choosing in the future. If the mandate is that officials should have access to your network for security concerns, who is to say they can’t just dictate what CPE you should use in order to facilitate that access. Larger companies can probably negotiate for come kind of on-site server that does network scanning. But smaller branches are likely going to need to have an all-in-one device at the head end doing all the work. The additional benefit for the Chinese is that control of the head end CPE ensures that you can’t build a site-to-site VPN anywhere.

Peering Into The Future

Greg and Drew pontificate a bit on the future on what this means for organizations from foreign countries doing business in China in the future. I tend to agree with them on a few points. I think you’re going to see a push for Chinese offices of major companies treating them like zero-trust endpoints. All communications will be trading minimal information. Networks won’t be directly connected, either by VPN substitute or otherwise.

Looking further down the road makes the plans even more murky. Is there a way that you can certify yourself to have a standard for cybersecurity? We have something similar with regulations here in the US were we can submit compliance reports for various agencies and submit to audits or have audits performed by third parties. But if the government won’t take that as an answer how do you even go about providing the level of detail they want? If the answer is “you can’t”, then the larger discussion becomes whether or not you can comply with their regulations and reduce your business exposure while still making money in this market. And that’s a conversation no technology can solve.


Tom’s Take

SD-WAN gives us a wonderful set of features included in the package. Things like application inspection are wonderful to look at on a dashboard but I’ve always been a bigger fan of the automatic VPN service. I like knowing that as soon as I turn up my devices they become secure endpoints for all my traffic. Alas, all the technology in the world can be defeated by business or government regulation. If the rules say you can’t have a feature, you either have to play by the rules or quit playing the game. It’s up to businesses to decide how they’ll react going forward. But SD-WAN’s greatest feature may now have to be an unchecked box on that dashboard.

The Confluence of SD-WAN and Microsegmentation

If you had to pick two really hot topics in the networking space right now, you’d be hard-pressed to find two more discussed than SD-WAN and microsegmentation. SD-WAN is the former “king of the hill” in the network engineering. I can remember having more conversations about SD-WAN in the last couple of years than anything else. But as the SD-WAN market has started to consolidate and iterate, a new challenger has arrived. Microsegmentation is the word of the day.

However, I think that SD-WAN and microsegmentation are quickly heading toward a merger of ideas and solutions. There are a lot of commonalities between the two technologies that make a lot of sense running together.

SD-WAN isn’t just about packet switching and routing any longer. That’s because networking people have quickly learned that packet-by-packet processing of traffic is inefficient. All of our older network analysis devices could only see things one IP packet at a time. But the new wave of devices think in terms of flows. They can analyze a stream of packets to figure out what’s going on. And what generates those flows?

Applications.

The key to the new wave of SD-WAN technology isn’t some kind of magic method of nailing up VPNs between branch offices. It’s not about adding new connectivity types. Instead, it’s about application identification. App identification is how SD-WAN does QoS now. The move to using app markers means a more holistic method of treating application traffic properly.

SD-WAN has significant value in application handling. I recently chatted with Kumar Ramachandran of CloudGenix and he echoed that part of the reason why they’ve been seeing growth and recently received a Series C funding round was because of what they’re doing with applications. The battle of MPLS versus broadband has already been fought. The value isn’t going to come from edge boxes unless there is software that can help differentiate the solutions.

Segmenting Your Traffic

So, what does this have to do with microsegmentation? If you’ve been following that market, you already know that the answer is the application. Microsegmentation doesn’t work on a packet-by-packet basis either. It needs to see all the traffic flows from an application to figure out what is needed and what isn’t. Platforms that do this kind of work are big on figuring out which protocols should be talking to which hosts and shutting everything else down to secure that communication.

Microsegmentation is growing in the cloud world for sure. I’ve seen and talked to people from companies like Guardicore, Illumio, ShieldX, and Edgewise in recent months. Each of them has a slightly different approach to doing microsegmentation. But they all look at the same basic approach form the start. The application is the basic building block of their technology.

With the growth of microsegmentation in the cloud market to help ensure traffic flows between hosts and sites is secured, it’s a no-brainer that the next big SD-WAN platform needs to add this functionality to their solution. I say this because it’s not that big of a leap to take the existing SD-WAN application analytics software that optimizes traffic flows over links and change it to restrict traffic flow with policy support.

For SD-WAN vendors, it’s another hedge against the inexorable march of traffic into the cloud. There are only so many Direct Connect analogs that you can build before Amazon decides to put you out of business. But, if you can integrate the security aspect of application analytics into your platform you can make your solution very sticky. Because that functionality is critical to meeting audit goals and ensuring compliance. And you’re going to wish you had it when the auditors come calling.


Tom’s Take

I don’t think the current generation of SD-WAN providers are quite ready to implement microsegmentation in their platforms. But I really wouldn’t be surprised to see it in the next revision of solutions. I also wonder if that means that some of the companies that have already purchased SD-WAN companies are going to look at that functionality. Perhaps it will be VMware building NSX microsegmentaiton on top of VeloCloud. Or maybe Cisco will include some of their microsegmentation from ACI in Viptela. They’re going to need to look at that strongly because once companies that are still on their own figure it out they’re going to be the go-to solution for companies looking to provide a good, secure migration path to the cloud. And all those roads lead to an SD-WAN device with microsegmentation capabilities.

QoS Is Dead. Long Live QoS!

Ah, good old Quality of Service. How often have we spent our time as networking professionals trying to discern the archaic texts of Szigeti to learn how to make you work? QoS is something that seemed so necessary to our networks years ago that we would spend hours upon hours trying to learn the best way to implement it for voice or bulk data traffic or some other reason. That was, until a funny thing happened. Until QoS was useless to us.

Rest In Peace and Queues

QoS didn’t die overnight. It didn’t wake up one morning without a home to go to. Instead, we slowly devalued and destroyed it over a period of years. We did it be focusing on the things that QoS was made for and then marginalizing them. Remember voice traffic?

We spent years installing voice over IP (VoIP) systems in our networks. And each of those systems needed QoS to function. We took our expertise in the arcane arts of queuing and applied it to the most finicky protocols we could find. And it worked. Our mystic knowledge made voice better! Our calls wouldn’t drop. Our packets arrived when they should. And the world was a happy place.

That is, until voice became pointless. When people started using mobile devices more and more instead of their desk phones, QoS wasn’t as important. When the steady generation of delay-sensitive packets instead moved back to LTE instead of IP it wasn’t as critical to ensure that FTP and other protocols in the LAN interfered with it. Even when people started using QoS on their mobile devices the marking was totally inconsistent. George Stefanick (@WirelesssGuru) found that Wi-Fi calling was doing some weird packet marking anyway:

So, without a huge packet generation issue, QoS was relegated to some weird traffic shaping roles. Maybe it was video prioritization in places where people cared about video? Or perhaps it was creating a scavenger class for traffic in order to get rid of unwanted applications like BitTorrent. But overall QoS languished as an oddity as more and more enterprises saw their collaboration traffic moving to be dominated by mobile devices that didn’t need the old dark magic of QoS.

QoupS de Gras

The real end of QoS came about thanks to the cloud. While we spent all of our time trying to find ways to optimize applications running on our local enterprise networks, developers were busy optimizing applications to run somewhere else. The ideas were sound enough in principle. By moving applications to the cloud we could continually improve them and push features faster. By having all the bit off the local network we could scale massively. We could even collaborate together in real time from anywhere in the world!

But applications that live in the cloud live outside our control. QoS was always bounded by the borders of our own networks. Once a packet was launched into the great beyond of the Internet we couldn’t control what happened to it. ISPs weren’t bound to honor our packet markings without an SLA. In fact, in most cases the ISP would remark all our packets anyway just to ensure they didn’t mess with the ISP’s ideas of traffic shaping. And even those were rudimentary at best given how well QoS plays with MPLS in the real world.

But cloud-based applications don’t worry about quality of service. They scale as large as you want. And nothing short of a massive cloud outage will make them unavailable. Sure, there may be some slowness here and there but that’s nothing less than you’d expect to receive running a heavy application over your local LAN. The real genius of the cloud shift is that it forced developers to slim down applications and make them more responsive in places where they could be made to be more interactive. Now, applications felt snappier when they ran in remote locations. And if you’ve every tried to use old versions of Outlook across slow links you now how critical that responsiveness can be.

The End is The Beginning

So, with cloud-based applications here to stay and collaboration all about mobile apps now, we can finally carve the tombstone for QoS right? Well, not quite.

As it turns out, we are still using lots and lots of QoS today in SD-WAN networks. We’re just not calling it that. Instead, we’ve upgraded the term to something more snappy, like “Application Visibility”. Under the hood, it’s not much different than the QoS that we’ve done for years. We’re still picking out the applications and figuring out how to optimize their traffic patterns to make them more responsive.

The key with the new wave of SD-WAN is that we’re marrying QoS to conditional routing. Now, instead of being at the mercy of the ISP link to the Internet we can do something else. We can push bulk traffic across slow cheap links and ensure that our critical business applications have all the space they want on the fast expensive ones instead. We can push our out-of-band traffic out of an attached 4G/LTE modem. We can even push our traffic across the Internet to a gateway closer to the SaaS provider with better performance. That last bit is an especially delicious piece of irony, since it basically serves the same purpose as Tail-end Hop Off did back in the voice days.

And how does all this magical new QoS work on the Internet outside our control? That’s the real magic. It’s all tunnels! Yes, in order to make sure that we get our traffic where it needs to be in SD-WAN we simply prioritize it going out of the router and wrap it all in a tunnel to the next device. Everything moves along the Internet and the hop-by-hop treatment really doesn’t care in the long run. We’re instead optimizing transit through our network based on other factors besides DSCP markings. Sure, when the traffic arrives on the other side it can be optimized based on those values. However, in the real world the only thing that most users really care about is how fast they can get their application to perform on their local machine. And if SD-WAN can point them to the fastest SaaS gateway, they’ll be happy people.


Tom’s Take

QoS suffered the same fate as Ska music and NCIS. It never really went away even when people stopped caring about it as much as they did when it was the hot new thing on the block. Instead, the need for QoS disappeared when our traffic usage moved away from the usage it was designed to augment. Sure, SD-WAN has brought it back in a new form, QoS 2.0 if you will, but the need for what we used to spend hours of time doing with ancient tomes on knowledge is long gone. We should have a quiet service for QoS and acknowledge all that it has done for us. And then get ready to invite it back to the party in the form that it will take in the cloud future of tomorrow.