Cisco and Viptela – The Price of Development Debt

Cisco finally pulled themselves into the SD-WAN market by acquiring Viptela on Monday. Viptela was considered to be one of, if not the leading SD-WAN vendor in the market. That Cisco decided to pick them as an acquisition target isn’t completely surprising. But one might wonder why?

IWANna New Debt

Cisco’s premier strategy for SD-WAN up until last week was IWAN. This is their catch-all solution designed to take the various component pieces being offered by SD-WAN solutions and replicate them on Cisco hardware. IWAN has served as a vehicle for Cisco to push things like the APIC-EM solution, Cisco ONE licensing, and a variety of other enhanced technologies like NBAR and PfR.

Cisco has packaged these technologies together because they have spent a couple of decades building these protocols up to be the best at what they do in the industry. NBAR was the key to application QoS years ago. PfR and OER were the genesis of Cisco having the ability to intelligently route packets to destinations. These protocols have formed the cornerstone of their platform for many, many years.

So why is IWAN such a mess? If you have the best of breed technology built into a router that makes the packets fly across the Internet at lightning speeds how is it that companies like Viptela were eating Cisco’s lunch in the SD-WAN space? It’s because those same best-of-breed protocols are to blame for the jigsaw puzzle of IWAN.

If you are the product manager for a protocol like NBAR or PfR, you want it to be adopted by as many people as possible. Wide adoption guarantees you’re going to have a job tomorrow or even next year. The people working on EIGRP and OSPF are safe. But if you get left behind technologically, you’re in for rough seas. Just ask the folks that managed LANE. But if you can attach yourself to a movement that’s got some steam, you’re in the drivers seat.

At the same time, you want your protocol or product to be the best at what it does. And sometimes being the best means you don’t compromise. That’s great when you are the only thing running on the system. But when you’re trying to get protocols to work together to create something bigger, you often find that compromises are not just a good idea, they’re necessary. But how do you handle it when the product manager for NBAR and the product manager for IP SLA get into a screaming match over who is going to blink first?

Using existing protocols and products is a great idea because it means you don’t have to reinvent the wheel every time you design something. But, with that wheel comes the technical debt of development. Given the chance to reuse something that thousands, if not millions, of dollars of R&D has gone into, companies like Cisco will jump at the chance to get some more longevity out of a protocol.

Not Pokey, But Gumby

Now, lets look at a scrappy startup like Viptela. They have to build their protocols from the ground up. Maybe they have the opportunity of leveraging some open source projects or some basic protocol implementations to get off the ground. That means that they are starting from essentially square one. It also means they are starting off with very little technical and development debt.

When Viptela builds their application monitoring stack or their IPSec VPN stack, they aren’t trying to build the best protocol for every possible situation that could ever be encountered by a wide variety of customers. They are just trying to build a protocol that works. And not just a protocol that works on its own. They want a protocol that works with everything else they are building.

When you’re forced to do everything from scratch, you find that you avoid making some of the same choices that you were forced to make years ago. The lack of technical and development debt also means you can take a new direction with things. Don’t want to support pre-shared key IPSec VPNs? Don’t build it into the protocol. Don’t care to have some of the quirks of PfR? Build something different that meets your needs. You have complete control.

Flexibility is why SD-WAN vendors were able to dominate the market for the past two years. They were able to adapt and change quickly because they didn’t need to keep trying to make systems integrate on top the tech and dev debt they incurred during the product lifecycle. That lets them concentrate on features that customers want, not on trying to integrate features that management has decreed must be included because the product manager was convincing in the last QBR.


Tom’s Take

In the end, the acquisition of Viptela by Cisco was as much about reduction of technical and development debt in their SD-WAN offerings as it was trying to get ahead in the game. They needed something that could be used as-is without the need to rely on any internal development processes. I alluded to this during our Network Collective Off-The-Cuff show. Without the spin-out model available any longer, Cisco is going to have to start making tough decisions to get things like this done. Either those decisions are made via reduction of business units without integration or through larger dollar signs to acquire solutions to provide the cohesion they need.

Building Reliability

Systems are inherently reliable. Until they aren’t. On a long enough timeline, even the most reliable system will eventually fail. How you manage that failure says a lot about the way your build your system or application. So, why is it then that we’re so focused on failing?

Ten Feet Tall And Bulletproof

No system is infallible. Networks go down. Cloud services get knocked offline. Even Facebook, which represents “the Internet” for a large number of people, has days when it’s unreachable. When we examine these outages, we often find issues at the core of the system that cause services to be unreachable. In the most recent case of Amazon’s cloud system, it was a typo in a script that executed faster than it could be stopped.

It could also be a failure of the system to anticipate increased loads when minor failures happen. If systems aren’t built to take on additional load when the worst happens, you’re going to see bigger outages. That is a particular thorn in the side of large cloud providers like Amazon and Google. It’s also something that network architects need to be aware of when building redundant pathways to handle problems.

Take, for example, a recent demo during Aruba Atmosphere 2017. During the Day 2 keynote, CTO Partha Narasimhan wowed the crowd in the room when he disclosed that they had been doing a controller upgrade during the morning talk. Users had been tweeting, surfing, and using the Internet without much notice from anyone aside from the most technical wireless minds in the room. Even they could only see some strange AP roaming behavior as an indicator of the controllers upgrading the APs.

Aruba showed that they built a resilient network that could survive a simulated major outage cause by a rolling upgrade. They’ve done everything they can to ensure uptime no matter what happens. But the bigger question for architects and engineers is “why are we solving the problem for others?”

Why Dodge Bullets When You Don’t Have To?

As amazing as it is to build a system that can survive production upgrades with no impact on users, what are we really building when we create these networks? Are we encouraging our users to respect our technology advantage in the network or other systems? Are we telling our application developers that they can count on us to keep the lights on when anything goes wrong? Or are we instead sending the message that we will keep scrambling to prevent issues in applications from being noticeable?

Building a resilient network is easy. Making something reliable isn’t rocket science. But create a network that is going to stay up for a long, long time without any outages is very expensive and process intensive. Engineering something to never be down requires layers of exception handling and backup systems that are as reliable as their primary counterparts.

A favorite story from the storage world involves recovery. When you initially ask a customer what their recovery point objective (RPO) is in a system, the answer is almost always “zero” or “as low as you can make it”. When the numbers are put together to include redundant or dual-active systems with replication and data assurance, the price tag of the solution is usually enough to start a new round of discussion along the lines of “how reliable can you make it for this budget?”

In the networking and systems world, we don’t have the luxury of sticker shock when it comes to creating reliability. Storage systems can have longer RPOs because lost data is gone forever. Taking the time to ensure it is properly recovered is important. But data in transmission can be retransmitted. That’s at the heart of TCP. So it’s expected that networks have near-instantaneous RPOs for no extra costs. If you don’t believe that, ask yourself what happens when you tell someone the network went down because there’s only one router or switch connecting devices together.

Instead of making systems ultra-reliable and absolving users and developers from thought, networks and systems should be as reliable as they can be made without huge additional costs. That reliability should be stated emphatically without wiggle room. These constraints should inform developers writing code so that exception handling can be built in to prevent issues when the inevitable outage occurs. Knowing your limitations is the first step to creating an atmosphere to overcome them.

A lesson comes from the programmers of old. When you have a limited amount of RAM, storage, or compute cycles, you can write very tight code. DOS programs didn’t need access to a cloud worth of compute. Mainframes could execute programs written on punch cards. The limitations were simple and could be overcome with proper problem solving. As compute and memory resources have exploded, so too have code bases. Rather than giving developers the limitless capabilities of the cloud without restraint, perhaps creating some limits is the proper way to ensure that reliability stays in the app instead of being bolted on to the network.


Tom’s Take

We had a lot of fun recording this roundtable. We talked about Aruba’s controller upgrade and building reliable wireless networks. But I think we also need to make sure we’re aware that continually creating protocols and other constructs in the underlay won’t solve application programming problems. Things like vMotion set networking and application development back a decade. Giving developers a magic solution to avoid building proper exception handling doesn’t make better developers. Instead, it puts the burden of uptime back on the networking team. And we would rather build the best network we can instead of building something that can solve every problem that could every possibly be created.

Sorting Through SD-WAN

lightspeed

SD-WAN has finally arrived. We’re not longer talking about it in terms of whether or not it is a thing that’s going to happen, but a thing that will happen provided the budgets are right. But while the concept of SD-WAN is certain, one must start to wonder about what’s going to happen to the providers of SD-WAN services.

Any Which Way You Can

I’ve written a lot about SDN and SD-WAN. SD-WAN is the best example of how SDN should be marketed to people. Instead of talking about features like APIs, orchestration, and programmability, you need to focus on the right hook. Do you see a food processor by talking about how many attachments it has? Or do you sell a Swiss Army knife by talking about all the crazy screwdrivers it holds? Or do you simply boil it down to “This thing makes your life easier”?

The most successful companies have made the “easier” pitch the way forward. Throwing a kitchen sink at people doesn’t make them buy a whole kitchen. But showing them how easy and automated you can make installation and management will sell boxes by the truckload. You have to appeal the opposite nature that SD-WAN was created to solve. WANs are hard, SD-WANs make them easy.

But that only works if your SD-WAN solution is easy in the first place. The biggest, most obvious target is Cisco IWAN. I will be the first to argue that the reason that Cisco hasn’t captured the SD-WAN market is because IWAN isn’t SD-WAN. It’s a series of existing technologies that were brought together to try and make and SD-WAN competitor. IWAN has all the technical credibility of a laboratory full of parts of amazing machines. What it lacks is any kind of ability to tie all that together easily.

IWAN is a moving target. Which platform should I use? Do I need this software to make it run correctly? How do I do zero-touch deployments? Or traffic control? How do I plug a 4G/LTE modem into the router? The answers to each of these questions involves typing commands or buying additional software features. That’s not the way to attack the complexity of WANs. In fact, it feeds into that complexity even more.

Cisco needs to look at a true SD-WAN technology. That likely means acquisition. Sure, it’s going to be a huge pain to integrate an acquisition with other components like APIC-EM, but given the lead that other competitors have right now, it’s time for Cisco to come up with a solution that knocks the socks off their longtime customers. Or face the very real possibility of not having longtime customers any longer.

Every Which Way But Loose

The first-generation providers of SD-WAN bounced onto the scene to pick up the pieces from IWAN. Names like Viptela, VeloCloud, CloudGenix, Versa Networks, and more. But, aside from all managing to build roughly the same platform with very similar features, they’ve hit a might big wall. They need to start making money in order for these gambles to pay off. Some have customers. Others are managing the migration into other services, like catering their offerings toward service providers. Still others are ripe acquisition targets for companies that lack an SD-WAN strategy, like HPE or Dell. I expect to see some fallout from the first generation providers consolidating this year.

The second generation providers, like Riverbed and Silver Peak, all have something in common. They are building on a business they’ve already proven. It’s no coincidence that both Riverbed and Silver Peak are the most well-known names in WAN optimization. How well known? Even major Cisco partners will argue that they sell these two “best of breed” offerings over Cisco’s own WAAS solution. Riverbed and Silver Peak have a definite advantage because they have a lot of existing customers that rely on WAN optimization. That market alone is going to net them a significant number of customers over the next few years. They can easily sell SD-WAN as the perfect addition to make WAN optimization even easier.

The third category of SD-WAN providers is the late comers. I still can’t believe it, but I’ve been reading about providers that aren’t traditional companies trying to get into the space. Talk about being the ninth horse in an eight horse race. Honestly, at this point you’re better off plowing your investment money into something else, like Internet of Things or Virtual Reality. There’s precious little room among the existing first generation providers and the second generation stalwarts. At best, all you can hope for is a quick exit. At worst, your “novel” technology will be snapped up for pennies after you’re bankrupt and liquidating everything but the standing desks.


Tom’s Take

Why am I excited about the arrival of SD-WAN? Because now I can finally stop talking about it! In all seriousness, when the boardroom starts talking about things that means it’s past the point of being a hobby project and now has become a real debate. SD-WAN is going to change one of the most irritating aspects of networking technology for us. I can remember trying to study for my CCNP and cramming all the DSL and T1 knowledge a person could fit into a brain in my head. Now, it’s all point-and-click and done. IPSec VPNs, traffic analytics, and application identification are so easy it’s scary. That’s the power of SD-WAN to me. Easy to use and easy to extend. I think that the landscape of providers of SD-WAN technologies is going to look vastly different by the end of 2017. But SD-WAN is going to be here for the long haul.

Is The Rise Of SD-WAN Thanks To Ethernet?

Ethernet

SD-WAN has exploded in the market. Everywhere I turn, I see companies touting their new strategy for reducing WAN complexity, encrypting data in flight, and even doing analytics on traffic to help build QoS policies and traffic shaping for critical links. The first demo I ever watched for SDN was a WAN routing demo that chose best paths based on cost and time-of-day. It was simple then, but that kind of thinking has exploded in the last 5 years. And it’s all thanks to our lovable old friend, Ethernet.

Those Old Serials

When I started in networking, my knowledge was pretty limited to switches and other layer 2 devices. I plugged in the cables, and the things all worked. As I expanded up the OSI model, I started understanding how routers worked. I knew about moving packets between different layer 3 areas and how they controlled broadcast storms. This was also around the time when layer 3 switching was becoming a big thing in the campus. How was I supposed to figure out the difference between when I should be using a big router with 2-3 interfaces versus a switch that had lots of interfaces and could route just as well?

The key for me was media types. Layer 3 switching worked very well as long as you were only connecting Ethernet cables to the device. Switches were purpose built for UTP cable connectivity. That works really well for campus networks with Cat 5/5e/6 cabling. Switched Virtual Interfaces (SVIs) can handle a large amount of the routing traffic.

For WAN connectivity, routers were a must. Because only routers were modular in a way that accepted cards for different media types. When I started my journey on WAN connectivity, I was setting up T1 lines. Sometimes they had an old-fashioned serial connector like this:

s-l300

Those connected to external CSU/DSU modules. Those were a pain to configure and had multiple points of failure. Eventually, we moved up in the world to integrated CSU/DSU modules that looked like this:

ehwic-2-ports-t-1-e-1

Those are really awesome because all the configuration is done on the interface. They also take regular UTP cables instead of those crazy V.35 monsters.

cisco_v35_old_large

But those UTP cables weren’t Ethernet. Those were still designed to be used as serial connections.

It wasn’t until the rise of MPLS circuits and Transparent LAN services that Ethernet became the dominant force in WAN connectivity. I can still remember turning up my first managed circuit and thinking, “You mean I can use both FastEthernet interfaces? No cards? Wow!”.

Today, Ethernet dominates the landscape of connectivity. Serial WAN interfaces are relegated to backwater areas where you can’t get “real WAN connectivity”. And in most of those cases, the desire to use an old, slow serial circuit can be superseded by a 4G/LTE USB modem that can be purchased from almost any carrier. It would appear that serial has joined the same Heap of History as token ring, ARCnet, and other venerable connectivity options.

Rise, Ethernet

The ubiquity of Ethernet is a huge boon to SD-WAN vendors. They no longer have to create custom connectivity options for their appliances. They can provide 3-4 Ethernet interfaces and 2-3 USB slots and cover a wide range of options. This also allows them to simplify their board designs. No more modular chassis. No crazy requirements for WIC slots, NM slots, or any other crazy terminology that Cisco WAN engineers are all too familiar with.

Ethernet makes sense for SD-WAN vendors because they aren’t concerned with media types. All their intelligence resides in the software running on the box. They’d rather focus on creating automatic certificate-based IPsec VPNs than figuring out the clock rate on a T1 line. Hardware is not their end goal. It is much easier to order a reference board from Intel and plug it into a box than trying to configure a serial connector and make a custom integration.

Even SD-WAN vendors that are chasing after the service provider market are benefitting from Ethernet ubiquity. Service providers may still run serial connections in their networks, but management of those interfaces at the customer side is a huge pain. They require specialized technical abilities. It’s expensive to manage and difficult to troubleshoot remotely. Putting Ethernet handoffs at the CPE side makes life much easier. In addition, making those handoffs Ethernet makes it much easier to offer in-line service appliances, like those of SD-WAN vendors. It’s a good choice all around.

Serial connectivity isn’t going away any time soon. It fills an important purpose for high-speed connectivity where fiber isn’t an option. It’s also still a huge part of the install base for circuits, especially in rural areas or places where new WAN circuits aren’t easily run. Traditional routers with modular interfaces are still going to service a large number of customers. But Ethernet connectivity is quickly growing to levels where it will eclipse these legacy serial circuits soon. And the advantage for SD-WAN vendors can only grow with it.


Tom’s Take

Ethernet isn’t the only reason SD-WAN has succeeded. Ease of use, huge feature set, and flexibility are the real reasons when SD-WAN has moved past the concept stage and into deployment. WAN optimization now has SD-WAN components. Service providers are looking to offer it as a value added service. SD-WAN has won out on the merits of the technology. But the underlying hardware and connectivity was radically simplified in the last 5-7 years to allow SD-WAN architects and designers to focus on the software side of things instead of the difficulties of building complicated serial interfaces. SD-WAN may not owe it’s entire existence to Ethernet, but it got a huge push in the right direction for sure.

Cloud Apps And Pathways

jam

Applications are king. Forget all the things you do to ensure proper routing in your data center. Forget the tweaks for OSPF sub-second failover or BGP optimal path selection. None of it matters to your users. If their login to Seibel or Salesforce or Netflix is slow today, you’ve failed. They are very vocal when it comes to telling you how much the network sucks today. How do we fix this?

Pathways Aren’t Perfect

The first problem is the cloud focus of applications. Once our packets leave our border routers it’s a giant game of chance as to how things are going to work next. The routing protocol games that govern the Internet are tried and true and straight out of RFC 1771(Yes, RFC 4271 supersedes it). BGP is a great tool with general purpose abilities. It’s becoming the choice for web scale applications like LinkedIn and Facebook. But it’s problematic for Internet routing. It scales well but doesn’t have the ability to make rapid decisions.

The stability of BGP is also the reason why it doesn’t react well to changes. In the old days, links could go up and down quickly. BGP was designed to avoid issues with link flaps. But today’s links are less likely to flap and more likely to need traffic moved around because of congestion or other factors. The pace that applications need to move traffic flows means that they tend to fight BGP instead of being relieved that it’s not slinging their traffic across different links.

BGP can be a good suggestion of path variables. That’s how Facebook uses it for global routing. But decisions need to be made on top of BGP much faster. That’s why cloud providers don’t rely on it beyond basic connectivity. Things like load balancers and other devices make up for this as best they can, but they are also points of failure in the network and have scalability limitations. So what can we do? How can we build something that can figure out how to make applications run better without the need to replace the entire routing infrastructure of the Internet?

GPS For Routing

One of the things that has some potential for fixing inefficiency with BGP and other basic routing protocols was highlighted during Networking Field Day 12 during the presentation from Teridion. They have a method for creating more efficiency between endpoints thanks to their agents. Founder Elad Rave explains more here:

I like the idea of getting “traffic conditions” from endpoints to avoid congestion. For users of cloud applications, those conditions are largely unknown. Even multipath routing confuses tried-and-true troubleshooting like traceroute. What needs to happen is a way to collect the data for congestion and other inputs and make faster decisions that aren’t beholden to the underlying routing structure.

Overlay networking has tried to do this for a while now. Build something that can take more than basic input and make decisions on that data. But overlays have issues with scaling, especially past the boundary of the enterprise network. Teridion has potential to help influence routing decisions in networks outside your control. Sadly, even the fastest enterprise network in the world is only as fast as an overloaded link between two level 3 interconnects on the way to a cloud application.

Teridion has the right idea here. Alternate pathways need to be identified and utilized. But that data needs to be evaluated and updated regularly. Much like the issues with Waze dumping traffic into residential neighborhoods when major arteries get congested, traffic monitors could cause overloads on alternate links if shifts happen unexpectedly.

The other reason why I like Teridion is because they are doing things without hardware boxes or the need to install software anywhere but the end host. Anyone working with cloud-based applications knows that the provider is very unlikely to provide anything outside of their standard offerings for you. And even if they manage, there is going to be a huge price tag. More often than not, that feature request will become a selling point for a new service in time that may be of marginal benefit until everyone starts using it. Then application performance goes down again. Since Teridion is optimizing communications between hosts it’s a win for everyone.


Tom’s Take

I think Teridion is on to something here. Crowdsourcing is the best way to gather information about traffic. Giving packets a better destination with shorter travel times means better application performance. Better performance means happier users. Happier users means more time spent solving other problems that have symptoms that aren’t “It’s slow” or “Your network sucks”. And that makes everyone happier. Even grumpy old network engineers.

Disclaimer

Teridion was a presenter during Networking Field Day 12 in San Francisco, CA. As a participant in Networking Field Day 12, my travel and lodging expenses were covered by Tech Field Day for the duration of the event. Teridion did not ask for nor where they promised any kind of consideration in the writing of this post. My conclusions here represent my thoughts and opinions about them and are mine and mine alone.

 

BGP: The Application Networking Dream

bgp

There was an interesting article last week from Fastly talking about using BGP to scale their network. This was but the latest in a long line of discussions around using BGP as a transport protocol between areas of the data center, even down to the Top-of-Rack (ToR) switch level. LinkedIn made a huge splash with it a few months ago with their Project Altair solution. Now it seems company after company is racing to implement BGP as the solution to their transport woes. And all because developers have finally pulled their heads out of the sand.

BGP Under Every Rock And Tree

BGP is a very scalable protocol. It’s used the world over to exchange routes and keep the Internet running smoothly. But it has other power as well. It can be extended to operate in other ways beyond the original specification. Unlike rigid protocols like RIP or OSPF, BGP was designed in part to be extended and expanded as needs changes. IS-IS is a very similar protocol in that respect. It can be upgraded and adjusted to work with both old and new systems at the same time. Both can be extended without the need to change protocol versions midstream or introduce segmented systems that would run like ships in the night.

This isn’t the first time that someone has talked about running BGP to the ToR switch either. Facebook mentioned in this video almost three years ago. Back then they were solving some interesting issues in their own data center. Now, those changes from the hyperscale world are filtering into the real world. Networking teams are seeking to solve scaling issues without resorting to overlay networks or other types of workarounds. The desire to fix everything wrong with layer 2 has led to a revelation of sorts. The real reason why BGP is able to work so well as a replacement for layer 2 isn’t because we’ve solved some mystical networking conundrum. It’s because we finally figured out how to build applications that don’t break because of the network.

Apps As Far As The Eye Can See

The whole reason when layer 2 networks are the primary unit of data center measurement has absolutely nothing to do with VMware. VMware vMotion behaves the way that it does because legacy applications hate having their addresses changed during communications. Most networking professionals know that MAC addresses have a tenuous association to IP addresses, which is what allows the gratuitous ARP after a vMotion to work so well. But when you try to move an application across a layer 3 boundary, it never ends well.

When web scale companies started building their application stacks, they quickly realized that being pinned to a particular IP address was a recipe for disaster. Even typical DNS-based load balancing only seeks to distribute requests to a series of IP addresses behind some kind of application delivery controller. With legacy apps, you can’t load balance once a particular host has resolved a DNS name to an IP address. Once the gateway of the data center resolves that IP address to a MAC address, you’re pinned to that device until something upsets the balance.

Web scale apps like those built by Netflix or Facebook don’t operate by these rules. They have been built to be resilient from inception. Web scale apps don’t wait for next hop resolution protocols (NHRP) or kludgy load balancing mechanisms to fix their problems. They are built to do that themselves. When problems occur, the applications look around and find a way to reroute traffic. No crazy ARP tricks. No sly DNS. Just software taking care of itself.

The implications for network protocols are legion. If a web scale application can survive a layer 3 communications issue then we are no longer required to keep the entire data center as a layer 2 construct. If things like anycast can be used to pin geolocations closer to content that means we don’t need to worry about large failover domains. Just like Ivan Pepelnjak (@IOSHints) says in this post, you can build layer 3 failure domains that just work better.

BGP can work as your ToR strategy for route learning and path selection because you aren’t limited to forcing applications to communicate at layer 2. And other protocols that were created to fix limitations in layer 2, like TRILL or VXLAN, become an afterthought. Now, applications can talk to each other and fail back and forth as they need to without the need to worry about layer 2 doing anything other than what it was designed to do: link endpoints to devices designed to get traffic off the local network and into the wider world.


Tom’s Take

One of the things that SDN has promised us is a better way to network. I believe that the promise of making things better and easier is a noble goal. But the part that has bothered me since the beginning was that we’re still trying to solve everyone’s problems with the network. We don’t rearrange the power grid every time someone builds a better electrical device. We don’t replumb the house overtime we install a new sink. We find a way to make the new thing work with our old system.

That’s why the promise of using BGP as a ToR protocol is so exciting. It has very little to do with networking as we know it. Instead of trying to work miracles in the underlay, we build the best network we know how to build. And we let the developers and programmers do the rest.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.