The Complexity Conundrum

NailPuzzle

Complexity is the enemy of understanding. Think about how much time you spend in your day trying to simplify things. Complexity is the reason why things like Reddit’s Explain Like I’m Five exist. We strive in our daily lives to find ways to simplify the way things are done. Well, except in networking.

Building On Shifting Sands

Networking hasn’t always been a super complex thing. Back when bridges tied together two sections of Ethernet, networking was fairly simple. We’ve spent years trying to make the network do bigger and better things faster with less input. Routing protocols have become more complicated. Network topologies grow and become harder to understand. Protocols do magical things with very little documentation beyond “Pure Freaking Magic”.

Part of this comes from applications. I’ve made my feelings on application development clear. Ivan Pepelnjak had some great comments on this post as well from Steve Chalmers and Derick Winkworth (@CloudToad). I especially like this one:

https://twitter.com/cloudtoad/status/750830744902127616

Derick is right. The application developers have forced us to make networking do more and more faster with less requirement for humans to do the work to meet crazy continuous improvement and continuous development goalposts. Networking, when built properly, is a static object like the electrical grid or a plumbing system. Application developers want it to move and change and breathe with their needs when they need to spin up 10,000 containers for three minutes to run a test or increase bandwidth 100x to support a rollout of a video streaming app or a sticker-based IM program designed to run during a sports championship.

We’ve risen to meet this challenge with what we’ve had to work with. In part, it’s because we don’t like being the scapegoat for every problem in the data center. We tire of sitting next to the storage admins and complaining about the breakneck pace of IT changes. We have embraced software enhancements and tried to find ways to automate, orchestrate, and accelerate. Which is great in theory. But in reality, we’re just covering over the problem.

Abstract Complexity

The solution to our software networking issues seems simple on the surface. Want to automate? Add a layer to abstract away the complexity. Want to build an orchestration system on top of that? Easy to do with another layer of abstraction to tie automation systems together. Want to make it all go faster? Abstract away!

“All problems in computer science can be solved with another layer of indirection.”

This is a quote from Butler Lampson often attributed to David Wheeler. It’s absolutely true. Developers, engineers, and systems builders keep adding layers of abstraction and indirection on top of complex system and proclaiming that everything is now easier because it looks simple. But what happens why the abstraction breaks down?

Automobiles are perfect example of this. Not too many years ago, automobiles were relatively simple things. Sure, internal combustion engines aren’t toys. But most mechanics could disassemble the engine and fix most issues with a wrench and some knowledge. Today’s cars have computers, diagnostics systems, and require lots of lots of dedicated tools to even diagnose the problem, let alone fix it. We’ve traded simplicity and ease of repairability the appearance of “simple” which conceals a huge amount of complexity under the surface.

To refer back to the Lampson/Wheeler quote, the completion of it is, “Except, of course, for the problem of too many indirections.” Even forty years ago it was understood that too many layers of abstraction would eventually lead to problems. We are quickly reaching this point in networking today. With all the reliance on complex tools providing an overwhelming amount of data about every point of the network, we find ourselves forced to use dashboards and data lakes to keep up with the rapid pace of changes dictated to the network by systems integrations being driven by developer desires and not sound network systems thinking.

Networking professionals can’t keep up. Just as other systems now must be maintained by algorithms to keep pace, so too does the network find itself being run by software instead of augmented by it. Even if people wanted to make a change they would be unable to do so because validating those changes manually would cause issues or interactions that could create havoc later on.

Simple Solutions

So how do we fix the issues? Can we just scrap it all and start over? Sadly, the answer here is a resounding “no”. We have to keep moving the network forward to match pace with the rest of IT. But we can do our part to cut down on the amount of complexity and abstraction being created in the process. Documentation is as critical as ever. Engineers and architects need to make sure to write down all the changes they make as well as their proposed designs for adding services and creating new features. Developers writing for the network need to document their APIs and their programs liberally so that troubleshooting and extension are easily accomplished instead of just guessing about what something is or isn’t supposed to be doing.

When the time comes to build something new, instead of trying to plaster over it with an abstraction, we need to break things down into their basic components and understand what we’re trying to accomplish. We need to augment existing systems instead of building new ones on top of the old to make things look easy. When we can extend existing ideas or augment them in such as way as to coexist then we can worry less about hiding problems and more about solving them.


Tom’s Take

Abstraction has a place, just like NAT. It’s when things spiral out of control and hide the very problems we’re trying to fix that it becomes an abomination. Rather than piling things on the top of the issue and trying to hide it away until the inevitable day when everything comes crashing down, we should instead do the opposite. Don’t hide it, expose it instead. Understand the complexity and solve the problem with simplicity. Yes, the solution itself may require some hard thinking and some pretty elegant programming. But in the end that means that you will really understand things and solve the complexity conundrum.

The 25GbE Datacenter Pipeline

pipeline

SDN may have made networking more exciting thanks to making hardware less important than it has been in the past, but that’s not to say that hardware isn’t important at all. The certainty with which new hardware will come out and make things a little bit faster than before is right there with death and taxes. One of the big announcements yesterday from Hewlett Packard Enterprise (HPE) during HPE Discover was support for a new 25GbE / 100GbE switch architecture built around the FlexFabric 5950 and 12900 products. This may be the tipping point for things.

The Speeds of the Many

I haven’t always been high on 25GbE. Almost two years ago I couldn’t see the point. Things haven’t gotten much different in the last 24 months from a speed perspective. So why the change now? What make this 25GbE offering any different than things from the nascent ideas presented by Arista?

First and foremost, the 25GbE released by HPE this week is based on the Broadcom Tomahawk chipset. When 25GbE was first presented, it was a collection of vendors trying to convince you to upgrade to their slightly faster Ethernet. But in the past two years, most of the merchant offerings on the market have coalesced around using Broadcom as the primary chipset. That means that odds are good your favorite switching platform is running Trident 2 or Trident 2+ under the hood.

With Broadcom backing the silicon, that means wider adoption of the specification. Why would anyone buy 25GbE from Brocade or Dell or HPE if the only vendor supporting it was that vendor of choice? If you can’t ever be certain that you’ll have support for the hardware in three or five years time, making an investment today seems silly. Broadcom’s backing means that eventually everyone will be adopting 25GbE.

Likewise, one of my other impediments to adoption was the lack of server NICs to ramp hosts to 25GbE. Having fast access ports means nothing if the severs can’t take advantage of them. HPE addressed this with the release of FlexFabric networking adapters that can run 25GbE Ethernet. More importantly, those adapters (and switches) can run at 10GbE as well. This means that adoption of higher bandwidth is no longer an all-or-nothing proposition. You don’t have to abandon your existing investment to get to 25GbE right away. You don’t have to build a lab pod to test things and then sneak it into production. You can just buy a 5950 today and clock the ports down to 10GbE while you await the availability and purchasing cycle to buy 25GbE NICs. Then you can flip some switches in the next maintenance window and be running at 25GbE speeds. And you can leave some ports enabled at 10GbE to ensure that there is maximum backwards compatibility.

The Needs of the Few

Odds are good that 25GbE isn’t going to be right for you today. HPE is even telling people that 25GbE only really makes sense in a few deployment scenarios, among which are large container-based hosts running thousands of virtual apps, flash storage arrays that use Ethernet as a backplane, or specialized high-performance computing (HPC) tricks with RDMA and such. That means the odds are good that you won’t need 25GbE first thing tomorrow morning.

However, the need for 25GbE is going to be there. As applications grow more bandwidth hungry and data centers keep shrinking in footprint, the network hardware you do have left needs to work harder and faster to accomplish more with less. If the network really is destined to become a faceless underlay that serves as a utility for applications, it needs to run flat out fast to ensure that developers can’t start blaming their utility company for problems. Multi-core server architectures and flash storage have solved two of the three legs of this problem. 25GbE host connectivity and the 100GbE backbone connectivity tied to it, solve the other side of the equation so everything balances properly.

Don’t look at 25GbE as an immediate panacea for your problems. Instead, put it on a timeline with your other server needs and see what the adoption rate looks like going forward. If server NICs are bought in large quantities, that will drive manufactures to push the technology onto the server boards. If there is enough need for connectivity at these speeds the switch vendors will start larger adoption of Tomahawk chipsets. That cycle will push things forward much faster than the 10GbE / 40GbE marathon that’s been going on for the past six years.


Tom’s Take

I think HPE is taking a big leap with 25GbE. Until the Dell/EMC merger is completed they won’t find themselves in a position to adopt Tomahawk quickly in the Force10 line. That means the need to grab 25GbE server NICs won’t materialize if there’s nothing to connect them. Cisco won’t care either way so long as switches are purchased and all other networking vendors don’t sell servers. So that leaves HPE to either push this forward to fall off the edge of the cliff. Time will tell how this will all work out, but it would be nice to see HPE get a win here and make the network the least of application developer problems.

Disclaimer

I was a guest of Hewlett Packard Enterprise for HPE Discover 2016. They paid for my travel, hotel, and meals during the event. While I was briefed on the solution discussed here and many others, there was no expectation of coverage of the topics discussed. HPE did not ask for, nor were they guaranteed any consideration in the writing of this article. The conclusions and analysis contained herein are mine and mine alone.

Flash Needs a Highway

CarLights

Last week at Storage Field Day 10, I got a chance to see Pure Storage and their new FlashBlade product. Storage is an interesting creature, especially now that we’ve got flash memory technology changing the way we think about high performance. Flash transformed the industry from slow spinning gyroscopes of rust into a flat out drag race to see who could provide enough input/output operations per second (IOPS) to get to the moon and back.

Take a look at this video about the hardware architecture behind FlashBlade:

It’s pretty impressive. Very fast flash storage on blades that can outrun just about anything on the market. But this post isn’t really about storage. It’s about transport.

Life Is A Network Highway

Look at the backplane of the FlashBlade chassis. It’s not something custom or even typical for a unit like that. The key is when the presenter says that the architecture of the unit is more like a blade server chassis than a traditional SAN. In essence, Pure has taken the concept of a midplane and used it very effectively here. But their choice of midplane is interesting in this case.

Pure is using the Broadcom Trident II switch as their networking midplane for FlashBlade. That’s pretty telling from a hardware perspective. Trident II runs a large majority of switches in the market today that are merchant silicon based. They are essentially becoming the Intel of the switch market. They are supplying arms to everyone that wants to build something quickly at low cost without doing any kind of custom research and development of their own silicon manufacturing.

Using a Trident II in the backplane of the FlashBlade means that Pure evaluated all the alternatives and found that putting something merchant-based in the midplane is cost effective and provides the performance profile necessary to meet the needs of flash storage. Saturating backplanes with IOPS can be accomplished. But as we learned from Coho Data, it takes a lot of CPU horsepower to create a flash storage system that can saturate 10Gig Ethernet links.

I Am Speed

Using Trident II as a midplane or backplane for devices like this has huge implications. First and foremost, networking technology has a proven track record. If Trident II wasn’t a stable and reliable platform, no one would have used it in their products. And given that almost everyone in the networking space has a Trident platform for sale, it speaks volumes about reliability.

Second, Trident II is available. Broadcom is throwing these units off the assembly line as fast as they can. That means that there’s no worry about silicon shortages or plant shutdowns or any one of a number of things that can affect custom manufacturing. Even if a company wants to look at a custom fabrication, it could take months or even years to bring things online. By going with a reference design like Trident II, you can have your software engineers doing the hard work of building a system to support your hardware. That speeds time to market.

Third, Trident is a modular platform. That part can’t be understated even though I think it wasn’t called out very much in the presentation from Pure. By having a midplane that is built as a removable module, it’s very easy to replace it should problems arise. That’s the point of field replaceable units (FRUs). But in today’s market, it’s just as easy to create a system that can run multiple different platforms as well. The blade chassis idea extends equally to both blades and mid or backplanes.

Imagine being able to order a Tomahawk-based controller unit for FlashBlade that only requires you to swap the units at the back of the system. Now, that investment in 10Gig blade connectivity with 40Gig uplinks just became 25Gig blade connectivity with 100Gig uplinks to the network. All for the cost of two network controller blades. There may be some software that needs to be written to make the transition smooth for the consumers in the system, but the hardware is more than capable of supporting a change like that.


Tom’s Take

I was thrilled to see Pure Storage building a storage platform that tightly integrates with networking the way that FlashBlade does. This is how the networking stack is going to be completely integrated with storage and compute. We should still look at things through the lens of APIs and programmability, but making networking and consistent transport layer for all things in the datacenter is a good start.

The funny thing about making something a consistent transport layer is that by design it has to be extensible. That means more and more folks are going to be driving those pieces into the network. Software can be created on top of this common transport to differentiate, much like we’re seeing with network operating systems right now. Even Pure was able to create a different kind of transport protocol to do the heavy lifting at low latency.

It’s funny that it took a presentation from a storage company to make me see the value of the network as something agnostic. Perhaps I just needed some perspective from the other side of the fence.

Pushing Everyone’s Buttons In IT

HistoryEraserButton

We have officially reached the point in our long and storied IT careers where we, as old fogies, have earned the right to complain about the next generation of users and professionals. Just as the gray beards before us complained about the way we did things, so too is it our turn to moan about the state of affairs. Today, I’d like to point out how driving IT to the point of pushing simple buttons is destroying the way we do things.

Easy Buttons

The fact that IT work has been able to be distilled into a series of simple button pushing exercises is very thrilling. We’ve spent a lot of time and effort building devices and frameworks that take the hard part out of building devices and frameworks. We no longer have to invent languages to build things or hardware to do things. Instead, we can refine our programming capabilities or use general purpose hardware in new combinations to provide environments for our users.

That’s one of the things that is driving people to the cloud. Cloud isn’t just about exciting hardware or keeping your data in other places. It is just as much about predictable, repeatable frameworks and workflows that let people accomplish tasks without much thought. Just like the assembly line of the eighties, we are take boring repetitive tasks away from people and giving them to machines that don’t get bored and love repetition. Doing that drives costs down and makes people more productive.

But what happens when those processes break down? What happens when the magic smoke is let out of the machine? That’s when the real issues start cropping up. Perhaps the framework developers are great at figuring out what exactly went wrong in their solution. However, they’re likely clueless about all of the things that their solution is built upon. Imagine if a mechanic couldn’t diagnose the engine or the various subsystems of your car? You’d have a fit, right?

The troubleshooting level of modern framework engineering only extends to the edge of their solution. Once you get into a problem with a service they’ve leveraged, such as AWS, then the buck is passed and the troubleshooting must move to a new team. The problem with having your teams building next-generation IT services on top of existing things is that they lose visibility into the things they didn’t build. It becomes a vicious cycle when those first-level services aren’t reliable enough to keep your solution running.

Push A Button, Push A Button!

As bad as things are for IT departments in a push-button world, it’s even crazier for users in the same boat. That’s because users understand two extremes of service delivery. Either the thing is working or it isn’t. There is no in between. In the old days of non-instant IT, users would just keep asking until a thing was done.

Today, things are much different. Automation and orchestration has allowed frameworks and platforms to be able to instantly create services. That means users have been able to create their own platform for building things. So anything that isn’t instant is broken. Take, for example, a recent discussion of bandwidth from the spring ONUG meeting. An IT professional was frustrated that it took the better part of a day to transfer 1 petabyte of information across the country. He wasn’t upset that it failed, mind you. He was mad because it didn’t happen in five minutes. Non-instant things are broken.

The story repeats itself over and over again. Networking resources that can’t be automatically provisioned are broken. Cloud services that need more than a few minutes to spin up aren’t working correctly. Mobile apps and sites that don’t instantly pop up on your phone must be working incorrectly. The patience level of a user isn’t even a tenth of the average IT professional. IT pros know why something is running slow. Users fall back on slow things being broken.


Tom’s Take

The problem is investment. Given an infinite amount of funding, everything can be fast. But people are trying to find ways to not pay for things in today’s IT world. Maybe they want to pay for AWS because it works. But OpenStack gives the promise of AWS for free. Except things like OpenStack and Ansible aren’t free. The currency you use to pay for them is time.

Time is just as precious a commodity as money. We don’t get refunds on time. We can’t ask for discounts on time. We can only invest and hope that there is a payoff down the road. The real outcome of this time investment looks very similar to the environment we have today. Things should just work and allow us to build new things on top of them. But that missing time investment is the key to the whole enterprise. If we don’t spend the time building and extending things, then we won’t have the expertise we need to fix things when they break. That time investment also helps everyone appreciate just how hard it is to build things in the first place. And that’s something no button will every duplicate.

The Death of TRILL

wasteland_large

Networking has come a long way in the last few years. We’ve realized that hardware and ASICs aren’t the constant that we could rely on to make decisions in the next three to five years. We’ve thrown in with software and the quick development cycles that allow us to iterate and roll out new features weekly or even daily. But the hardware versus software battle has played out a little differently than we all expected. And the primary casualty of that battle was TRILL.

Symbiotic Relationship

Transparent Interconnection of Lots of Links (TRILL) was proposed as a solution to the complexity of spanning tree. Radia Perlman realized that her bridging loop solution wouldn’t scale in modern networks. So she worked with the IEEE to solve the problem with TRILL. We also received Shortest Path Bridging (SPB) along the way as an alternative solution to the layer 2 issues with spanning tree. The motive was sound, but the industry has rejected the premise entirely.

Large layer 2 networks have all kinds of issues. ARP traffic, broadcast amplification, and many other numerous issues plague layer 2 when it tries to scale to multiple hundreds or a few thousand nodes. The general rule of thumb is that layer 2 broadcast networks should never get larger than 250-500 nodes lest problems start occurring. And in theory that works rather well. But in practice we have issues at the software level.

Applications are inherently complicated. Software written in the pre-Netflix era of public cloud adoption doesn’t like it when the underlay changes. So things like IP addresses and ARP entries were assumed to be static. If those data points change you have chaos in the software. That’s why we have vMotion.

At the core, vMotion is a way for software to mitigate hardware instability. As I outlined previously, we’ve been fixing hardware with software for a while now. vMotion could ensure that applications behaved properly when they needed to be moved to a different server or even a different data center. But they also required the network to be flat to overcome limitations in things like ARP or IP. And so we went on a merry journey of making data centers as flat as possible.

The problem came when we realized that data centers could only be so flat before they collapsed in on themselves. ARP and spanning tree limited the amount of traffic in layer 2 and those limits were impossible to overcome. Loops had to be prevented, yet the simplest solution disabled bandwidth needed to make things run smoothly. That caused IEEE and IETF to come up with their layer 2 solutions that used CLNS to solve loops. And it was a great idea in theory.

The Joining

In reality, hardware can’t be spun that fast. TRILL was used as a reference platform for proprietary protocols like FabricPath and VCS. All the important things were there but they were locked into hardware that couldn’t be easily integrated into other solutions. We found ourselves solving problem after problem in hardware.

Users became fed up. They started exploring other options. They finally decided that hardware wasn’t the answer. And so they looked to software. And that’s where we started seeing the emergence of overlay networking. Protocols like VXLAN and NV-GRE emerged to tunnel layer 2 packets over layer 3 networks. As Ivan Pepelnjak is fond of saying layer 3 transport solves all of the issues with scaling. And even the most unruly application behaves when it thinks everything is running on layer 2.

Protocols like VXLAN solved an immediate need. They removed limitations in hardware. Tunnels and fabrics used novel software approaches to solve insurmountable hardware problems. An elegant solution for a thorny problem. Now, instead of waiting for a new hardware spin to fix scaling issues, customers could deploy solutions to fix the issues inherent in hardware on their own schedule.

This is the moment where software defined networking (SDN) took hold of the market. Not when words like automation and orchestration started being thrown about. No, SDN became a real thing when it enabled customers to solve problems without buying more physical devices.


Tom’s Take

Looking back, we realize now that building large layer 2 networks wasn’t the best idea. We know that layer 3 scales much better. Given the number of providers and end users running BGP to top-of-rack (ToR) switches, it would seem that layer 3 scales much better. It took us too long to figure out that the best solution to a problem sometimes takes a bit of thought to implement.

Virtualization is always going to be limited by the infrastructure it’s running on. Applications are only as smart as the programmer. But we’ve reached the point where developers aren’t counting on having access to layer 2 protocols that solve stupid decision making. Instead, we have to understand that the most resilient way to fix problems is in the software. Whether that’s VXLAN, NV-GRE, or a real dev team not relying on the network to solve bad design decisions.

Is Interop Dead?

interop_logo_blk

I’m at Interop this week talking all things networking with a great group of people. There are quite a few members of the community here presenting, listening and discussing. There’s a great exchange of ideas flowing back and forth. Yet one thing I keep hearing in quiet corners of the room is a hushed discussion of the continued viability of Interop as a conference. Is it time to write the Interop obituary?

Only Mostly Dead

Some of the arguments are as old as tech itself. People claim that getting vendors to interoperate today is an afterthought thanks to protocols like OSPF. All of the important bits in a network are standardized now. Use of APIs and other open technologies are driving vendors to play nice with each other. The need to show up in a faraway place and do the work has long passed.

There’s also the discussion around the bigger conferences out in the world. Vendor conferences like Cisco Live and VMworld draw tens of thousands. New product announcements are dropping left and right during these events. People also want to fracture into tool-specific events like OpenStack Summit or DockerCon. Or the various analyst events or company days that happen every month. Why have a conference like Interop when others do it bigger?

Lastly, there’s the argument that the idea of an expo floor is long gone. Why should we get a booth on the show floor to show off our solution? Why not just approach people directly? Why spend money to be there like everyone else. Companies that stand out get noticed. Companies that leverage social media and SEO get the business. Not companies that buy space on a floor somewhere.

Exaggerated Rumors

Let’s jump on these one at a time.

First, claiming that all vendors play nice today thanks to standard protocols is misguided at best and downright silly at worst. Just because OSPF is standard doesn’t mean that people aren’t going to bend it to their own needs. Remember totally stubby areas? Those were very Cisco-specific for a long time. And throwing APIs into the discussion muddies the waters even further. Just because you have an API attached to your software doesn’t mean you can interoperate. How well is your API documented? Do I need to buy SDK access to get into it? Is my code portable? Do you do something silly like not supporting Python but loving things like Java and Perl?

Interoperability is a problem that hasn’t been solved completely. Even if everything works on paper and in Powerpoint slides, there’s still an acid test when you plug it all in for real and make it work. That moment of relief when two different vendor’s devices come up with the same protocols and everything talks is a magical time. You can’t replicate that in documentation. There’s still a huge need to have companies show up and physically prove that things work in a neutral setting versus a heavily slanted bakeoff lab somewhere.

Secondly, let’s look at those other conferences. Sure, Cisco Live has 20,000 people. That all want to talk about Cisco. There isn’t a whole lot of room for Juniper, Brocade, or HPE to be there. Differing opinions aren’t welcome. At best they are discussed in hushed tones in the corners of the room (sound familiar?). Mindshare is important, but so are honest discussions. As for smaller conferences focused on specific tools like OpenStack Summit, you quickly see the limits of things when you start talking to attendees about other things like pieces of the stack not addressed by the solution. There’s importance in being able to talk about all the parts without being overly myopic on one part.

The other piece of the other conferences refers back to a conversation that happened on Twitter last month about community content in large vendor conferences. There was talk from a number of people about how vendor conferences freeze out community content because people pay big bucks to come here a Technical Marketing Engineer read slides about configuring a feature. It’s a far cry from the deep discussion and analysis that you get in other places. How do you work around bugs in code? What happens when a feature is missing? Communities solve problems. Big conferences do a bad job of getting community involvement outside of expo floor crawls and keynotes. If they didn’t, we wouldn’t need the amazing work of vBrownBag and Security BSides. Community matters to people.

Thirdly, the expo floor discussion. I’ll admit that I find expo floors to be personally challenging, but to eschew them in favor of a completely social strategy or relying on SEO to pop up on people’s radar is a recipe for disaster. Being able to physically talk to a person is still a valuable part of the process. How do you feel about their approach? Are they happy with the technology? Does it work in the physical demo? Do you get the sense that they are pushing too hard or trying to close you on a sale without giving you what you need. That’s an experience you don’t get via email or DM on Twitter. Whether or not my personal feelings matter, the truth is that the expo does still matter.


Tom’s Take

I’m biased. I’ve loved Interop for a number of years even before I got involved in the programming of it. The idea that people show up to prove definitively that their stuff works with all the other stuff is a gut check like no other. In the last couple of years the planners of the conference have worked hard to make sure that the programming has reflected the kinds of things that people need to be aware of on the horizon and what they need to be learning. Less focus on vendor sales pitches and more on independent content. Lean and mean and ready to fight the rest of the conference world to prove that content is still king.

As long as I’m involved, I can promise that any rumors of Interop’s impending death with stay just that – talk and rumor. There’s still a lot of life left in this grand event to make it matter to people that should matter.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.

Adapting Applications And Avoiding Acrobatic Adjustments

JumpingHoops

A couple of months ago, I was on a panel at TechUnplugged where we talked about scaling systems to large sizes. Here’s a link to the video of that panel:

One of the things that we discussed in that panel was applications. Toward the end of the discussion we got into a bit of a back-and-forth about applications and the systems they run on. I feel like it’s time to develop those ideas a bit more.

The Achilles’ Heel

My comments about legacy applications are pointed. If a company is spending thousands of dollars and multiples hours of time in the engineering team to reconfigure the network or the storage systems to support an old application, my response was simple: go out of business.

It does sound a bit flippant to think that a company making a profit should just close the shutters and walk away. But that’s just the problem that we’re facing in the market today. We’ve spent an inordinate amount of time creating bespoke, custom networks and systems to support applications that were written years, or even decades, ago in alien environments.

We do it every day without thinking. We have to install this specific Java version on the server to get the payroll application running. Don’t install the new security updates on this server because it breaks the HR database. We have to have the door security server running in the same VLAN as the camera system or they won’t communicate. The list seems to be endless.

The problem is not that we’re jumping through hoops to fix all of these things. The problem is really that we shouldn’t have to do it. Look at all of the work that’s being done in agile development today. Companies are creating programs that do things that we thought were impossible just a few years ago. Features are being added rapidly without compromising stability. Applications can be deployed and run in the cloud or on-site. It’s a brave new world if you’re willing to throw out the basic assumption that applications are immutable.

The Arrow of Apollo

Everyone has made concessions to an application at some point in their career. We’ve done very dumb things to make software work. We do it so often that it becomes second nature. Entire product features have been developed because applications are too stupid to work in certain scenarios.

Imagine if you will what happens when you’re told that the new HR database doesn’t like it when the database is slow to respond. You likely start thinking of all the ways that you can make the network respond faster or reroute traffic to prevent this scenario from happening. You’re willing to actively destroy and reconfigure network architecture because one software program doesn’t like latency.

How far are you willing to take it? In some cases the answers are pretty scary. Cisco has a line of industrial switches with integrated NAT because some systems don’t know how to handle not being hard-coded with IP addresses. I’ve disabled roaming between access points in a wireless network because the medical records system couldn’t handle roaming to a new AP as it made the database connection close instantly. I’ve even had to disable QoS in a network because the extra latency it was causing in the CRM application was making the sales people cranky.

All of these solutions to problems assumed that the application that was having the problem can’t be fixed. What would happen if we turned that idea back around and fought the fight differently? What if we started the discussion of application problems by working through application programming issues instead? What if the fault of the system wasn’t at the network level or the compute level but instead was at the software level because the development team did a bad job of error handling?

With all the talk of making resources more like the cloud, where the underlay systems exist in a void apart from the software running on top of them we do have the draw the line at bad application programming. Netflix can’t assume that Amazon will create a special channel link to make application recovery faster. They have to write their software in such as way that it can survive problems in the underlay and avoid them.

Application developers need to stop thinking that the underlying systems will always work perfectly and fix any problems that might crop up. Instead, those applications that are being written need to account for issues and solve their own problems. The Netflix Chaos Monkey ensures that application folks don’t get complacent and code their programs to work when other things are absent. If Netflix stopped working for anything less than AWS going down they wouldn’t have the subscriber base they enjoy today. Netflix makes money because they don’t assume that the networks and systems will always be there. They make money because they know how to react when those systems aren’t there.


Tom’s Take

The new era of networking and mobile computing have embraced software. Containers have shown us how we can create instances that can be destroyed and recreated as needed. Applications on mobile devices work like they are supposed to because we can’t get into those subsystems to make workarounds for things. It’s time for other application developers to realize that too. We need devs that are willing to write logic into their programs to take control when things go wrong. Every thing that we can do to limit custom network configurations for applications takes us one step closer to the utopia we’ve always wanted – software that can run anywhere and scale as much as needed. Because resilient software is better than trying to fix the problems in hardware any day.

Intel and the Network Arms Race

IntelLogo

Networking is undergoing a huge transformation. Software is surely a huge driver for enabling technology to grow by leaps and bounds and increase functionality. But the hardware underneath is growing just as much. We don’t seem to notice as much because the port speeds we deal with on a regular basis haven’t gotten much faster than the specs we read about years go. But the chips behind the ports are where the real action is right now.

Fueling The Engines Of Forwarding

Intel has jumped into networking with both feet and is looking to land on someone. Their work on the Data Plane Development Kit (DPDK) is helping developers write code that is highly portable across CPU architecture. We used to deal with specific microprocessors in unique configurations. A good example is Dynamips.

Most everyone is familiar with this program or the projects that spawned, Dynagen and GNS3. Dynamips worked at first because it emulated the MIPS processor found in Cisco 7200 routers. It just happened that the software used the same code for those routers all the way up to the first releases of the 15.x train. Dynamips allowed for the emulation of Cisco router software but it was very, very slow. It almost didn’t allow for packets to be processed. And most of the advanced switching features didn’t work at all thanks to ASICs.

Running networking code on generic x86 processors doesn’t provide the kinds of performance that you need in a network switching millions of packets per second. That’s why DPDK is helping developers accelerate their network packet forward to approach the levels of custom ASICs. This means that a company could write software for a switch using Intel CPUs as the base of the system and expect to get good performance out of it.

Not only can you write code that’s almost as good as the custom stuff network vendors are creating, but you can also have a relative assurance that the code will be portable. Look at the pfSense project. It can run on some very basic hardware. But the same code can also run on a Xeon if you happen to have one of those lying around. That performance boost means a lot more packet switching and processing. No modifications to the code needed. That’s a powerful way to make sure that your operating system doesn’t need radical modifications to work across a variety of platforms, from SMB and ROBO all the way to an enterprise core device.

Fighting The Good Fight

The other reason behind Intel’s drive to get DPDK to everyone is to fight off the advances of Broadcom. It used to be that the term merchant silicon meant using off-the-shelf parts instead of rolling your own chips. Now, it means “anything made by Broadcom that we bought instead of making”. Look at your favorite switching vendor and the odds are better than average that the chipset inside their most popular switches is a Broadcom Trident, Trident 2, or even a Tomahawk. Yes, even the Cisco Nexus 9000 runs on Broadcom.

Broadcom is working their way to the position of arms dealer to the networking world. It soon won’t matter what switch wins because they will all be the same. That’s part of the reason for the major differentiation in software recently. If you have the same engine powering all the switches, your performance is limited by that engine. You also have to find a way to make yourself stand out when everything on the market has the exact same packet forwarding specs.

Intel knows how powerful it is to become the arms dealer in a market. They own the desktop, laptop, and server market space. Their only real competition is AMD, and one could be forgiven for arguing that the only reason AMD hasn’t gone under yet is through a combination of video card sales and Intel making sure they won’t get in trouble for having a monopoly. But Intel also knows what it feels like to miss the boat on a chip transition. Intel missed the mobile device market, which is now ruled by ARM and custom SoC manufacturing. Intel needs to pull off a win in the networking space with DPDK to ensure that the switches running in the data center tomorrow are powered by x86, not Broadcom.


Tom’s Take

Intel’s on the right track to make some gains in networking. Their new Xeon chips with lots and lots of cores can do parallel processing of workloads. Their contributions to CoreOS will help the accelerate the adoption of containers, which are becoming a standard part of development. But the real value for Intel is helping developers create portable networking code that can be deployed on a variety of devices. That enables all kinds of new things to come, from system scaling to cloud deployment and beyond.

Video And The Death Of Dialog

video

I was reading a trivia article the other day about the excellent movie Sex, Lies, & Videotape when a comment by the director, Stephen Soderbergh, caught my eye. The quote, from this article talks about how people use video as a way to distance ourselves from events. Soderbergh used it as a metaphor in a movie made in 1989. In today’s society, I think video is having this kind of impact on our careers and our discourse in a much bigger way.

Writing It Down In Pictures

People have become huge consumers of video. YouTube gets massive amounts of traffic. Devices have video recording capabilities built in. It’s not uncommon to see a GoPro camera attached to anything and everything and see people posting videos online of things that happen.

My son is a huge fan of videos about watching other people play video games. He’ll watch hours of video of someone playing a game and narrating the experience. When I tell him that he’s capable of playing the game himself he just tells me, “It’s not as fun that way Dad.” I, too, have noticed that a lot of things that would normally have been written down are narrated as videos today.

A great example of this is the Stuck in Traffic video blog series from J Wolfgang Goerlich (@JWGoerlich). These videos are great examples of things that would have been blog posts just a few years ago but have become videos that were narrated and posted to a channel for people to consume. This is also the way that podcasts have risen to dominate the attention of people looking to consume information. But video requires a bit more attention as compared to audio-only discussions.

One of the big issues I see with videos is that they are not living, breathing documents. They exist as they are created with no way to modify the content short of destroying it and recreating it. If I write a blog post and make a factual error, it’s very easy to fix that issue. I can write a note about how I made a mistake and someone pointed it out. Or, in some cases, I can write a whole new post about the error and how I figured out what was wrong.

But video is different. I find all too often that people make factual errors in videos by either misstating something or being plain mistaken. Usually these errors aren’t corrected in the process because the subject is unaware of things. But instead of being able to correct it with a follow up or a postscript, most video producers are forced to cover the incorrect comment with a large annotation in the video window pointing out that they were wrong and force the viewer to pay attention to the comment and not the spoken word or written word in the original video.

The lack of ability to correct problems and create living documents is a huge one for video creators. Errors can’t be easily fixed. On platforms like YouTube you can’t even upload a new video in place of the old one without destroying all of your views and comments. It makes mea culpas a huge pain. There’s no process to fix things unless you catch them before posting.

On Broadcast With No Mic

The other thing that bothers me the most about video is actually very similar to  the reason why I hate keynotes. The whole process of broadcasting a message without soliciting feedback is irritating at best. With a blog post, you can have comments and discussions and even more posts about subjects that go on to create commentary. Videos are static. You can’t start watching one and them come back to it later like you can with a blog post. Of course videos can be paused, but the vast majority of video creators do their best to minimize the amount of dead air in a video.

It’s also very difficult to create discussion with videos. Instead of being able to address points one at a time and create dialog there, you are forced to address or refute points in a series with no stopping. That makes it easy to overlook things without realizing that you missed a great idea or you could have summed things up with a very easy point somewhere else.

What’s missing is the ability to let conversations develop. If you think of Slack and email as the ultimate form of conversation, video is the ultimate form of one-sided discussion. Television, movies, and other video sources are designed to deliver content with no regard for feedback going the other direction. That is the key that is missing to make video be something beyond a simple broadcast medium.


Tom’s Take

Video is a tool designed to get your views across with minimal input from viewers. It took me a while to realize this until I heard my son “closing” a video with the standard “like, share, and subscribe” type of sign off. There wasn’t any mention of leaving comments or creating a video reply. It was really at that point that a I realized that video blogs and channels are the pinnacle of insulating us from the audience. All a creator needs to do it post videos and turn off comments and you can almost guarantee that they can continue creating messages that people will hear but never be able to respond to.