DPUs Could Change The Network Forever

You wouldn’t think that AWS re:Invent would be a big week for networking, would you? Most of the announcements are focused on everything related to the data center but teasing out the networking specific pieces isn’t as easy. That’s why I found mention of a new-ish protocol in an unrelated article to be fascinating.

In this Register piece about CPUs there’s a mention of the Nitro DPU. More importantly there’s also a reference to something that Amazon has apparently been working on for the last couple of years. It turns out that the world’s largest online bookstore and data center company is looking to get rid of TCP.

Rebuilding Transport

The new protocol was developed in 2020. Referred to as Scalable Reliable Datagram (SRD), it was build to solve specific challenges Amazon was seeing related to performance in their cloud. Amazon decided that TCP had bigger issues for them that they needed to address.

The first was that dropped packets required retransmission. In an environment like the Internet that makes sense. You want to get the data you lost. However, when TCP was developed fifty years ago the amount of data that was lost in transit was tiny compared to the flows of today. Likewise, TCP doesn’t really know how to take advantage of multi path flows. That’s because any packet that arrives out-of-order requires the whole thing to be reassembled in order to be read by the operating system. That makes total sense when you think about the relative power of a computer back in the 70s and 80s. If the CPU is already super busy trying to do other things you don’t want it to have to try to deal with a messy stream of packets too. Halting the flow until it’s reassembled makes sense.

Today, Amazon would rather have the packets arrive on multiple different paths and be reassembled higher up the stack. Instead of forcing the system to stop transmitting the entire flow until the lost or out-of-order pieces are fixed they’d rather get the whole thing delivered and start putting the puzzle together while the I/O interface is processing the next flow. That’s because the size of the flows is creating communications issues between systems. They would rather have slightly higher latency to increase performance.

DPUs Or Bust

If this is so obvious, why has no one rewritten TCP before? Well, we hinted at it when we discussed the fact that TCP is stopping the flow to sort out the mess. The CPU will be fully tasked with reassembling flows with something like SRD because networking communications are constant. If you’re hoping that whatever your successor is will just magically sort things out you’re going to find a system that is quite busy with things that shouldn’t be taking so much time. The perceived gains in performance are going to evaporate.

The key to how SRD will actually work lies not in the protocol but in the way it’s implemented in hardware. SRD only works if you’re using an AWS Nitro DPU. When Amazon says they want the packets to be reassembled “up the stack” they’re really saying the want the DPU to do the work to put the pieces back together before presenting the reassembled packet back to the system. The system itself doesn’t know the packets came in out-of-order. The system doesn’t even know how the packets arrived. It just knows it sent data somewhere else and that it arrived with no errors.

The magic here is the DPU. TCP works for the purpose it was designed to do. If you’re transmitting packets over a wide area in serial fashion and there’s packet loss on the link TCP is the way to go. Amazon SRD only works with Nitro-configured systems in AWS. It’s a given that many servers are now in AWS and more than a few are going to have this additional piece of hardware installed and configured. The value is that having this enabled is going to increase performance. But there’s a cost.

I think back to configuring jumbo Ethernet frames for a storage array that I was configuring back in 2011. Enabling 9000 byte frames between the switch and the array really increased performance. However, if you plugged the array in anywhere else or plugged anything into that port on the switch it broke. Why? Because 9000 byte Ethernet isn’t the standard. It’s a special configuration that only works when explicitly enabled.

Likewise, SRD works well within Amazon. You need to specifically enable it on your servers and those performance gains aren’t going to happen if you need to talk to something that isn’t SRD-enabled or isn’t configured with a Nitro DPU. Because the hard work is being done by the DPU you don’t have to worry about a custom hardware configuration in your application ruining your ability to migrate. However, if you’re relying on a specific performance level with the app to make things happen, like database lookups or writing files to a SAN, you’re not going to be able to move that app to another cloud without incurring a performance penalty.


Tom’s Take

I get that companies like Amazon are heavily invested in Ethernet but want higher performance. I also get that DPUs are going to enable us to do things with systems that we’ve never even really considered before, like getting rid of TCP and optimizing how packets are sent through the network. It’s a grand idea but it not only breaks compatibility but creates a gap in performance expectations as well. We already see this in the wireless space. People want gigabit file transfers with Wi-Fi 6 and are disappointed because their WAN connection isn’t that fast. If Amazon is promising super fast communications between systems users are going to be disappointed when they don’t get that in other clouds or between non-DPU systems. It creates lock-in and sets expectations that can’t be met. The future of networking is going to involve DPUs. But in order to really change the way we do things we need to work with the standards we have and make sure everyone can benefit.

Hedgehog – The Network OS Distro?

You’ve probably seen by now that there’s a new entrant into the market for network operating systems. Hedgehog came out of stealth mode this week to fanfare from the networking community. If you read through the website you might question why I labeled them as a network operating system. While they aren’t technically the OS I think it’s more important to look at them as an OS distribution.

Cacophony of Choice

Hedgehog starts from a very simple premise. Cloud networking is where we’re all headed. Whether or not you’re running entirely on-premises, fully in the public cloud, or in some kind of super-multi-hybrid cloud offering you’re all chasing the same thing. You want a stable system that acts as a force multiplier for your operations teams to reduce deployment times for users to get their builds done. It’s been said before but the idea of cloud is to get IT out of the way of the business.

Streamlining processes means automating a lot of the things that were formerly done by people. That means building repeatable and consistent tools to make that happen. If anyone has ever worked on AWS or Google Cloud you have lots of access to that tooling. Perhaps it’s not as full-featured as rolling your own toolset but that’s the tradeoff for running in someone else’s cloud. Notice that I left Microsoft Azure off that list.

Azure’s networking stack has been built on SONiC, a LInux-based NOS that has been built to scale in the cloud and solve the challenges that Microsoft has faced in their hyperscale data centers. They’ve poured resources into making SONiC exactly what they needed. However, one of the challenges that is faced when that happens is some of those things don’t scale down to the enterprise. I’m not saying you shouldn’t use SONiC. I’m saying that it’s not easy to adapt SONiC to what you want to do if you’re not Microsoft.

Speedy Adoption

In a way, it’s the same problem that Linux faced 25 years ago. If you really wanted to run it on a system you could download the source code and compile it on your system to get the kernel running. However a kernel doesn’t do much without software running on top of it. How could I write blog posts or check the time or get on the Internet without other applications? That need for additional resources to make the process of using Linux easier and more complete is where we got the rise of the Linux distribution, often shortened to distro.

Distros made adopting Linux easier. You didn’t have to go out and find sources for programs to run them on your system after compiling the kernel. You could just install everything like a big package and get going. It also meant that you could swap out programs and tools much easier than other operating systems. Ever tried to get rid of Notepad on Windows? It’s practically a system tool. On the other hand I think most Linux users can tell me their five favorite text editors off the top of their head. The system is very extensible.

Hedgehog acts like the distro of yore for SONiC. It makes the process of obtaining the OS much easier than it would be otherwise and includes a toolset that amplifies your networking experience. The power of cloud networking comes from optimization and orchestration. Hedgehog gives you that capability. It allows you to run the same kinds of tooling that you would use in the cloud on your enterprise data center networking devices.

If you’re starting to standardize on Kubernetes to make your applications more portable to the cloud then Hedgehog and SONiC can help you. If you’re looking to build more edge computing functionality into the stack Hedgehog has you covered. The Hedgehog team is building the orchestration capabilities that are needed in the enterprise right now to help you leverage SONiC. Because that tooling doesn’t exist outside of Microsoft right now you can believe that the Hedgehog team is addressing the needs of enterprise operations teams. They are the drivers for SONiC adoption. Making sure they can take care of daily tasks is paramount.

The distro launched Linux into the enterprise. Clunky DIY tooling can only scale so far. If you want to be serious about adopting cloud-first mentality in your organization you need to make sure you’re using proven tools that scale and don’t fall apart every time you try to make a minor change. Your data center isn’t Facebook or Google or Azure. However the lessons learned there and the way that we apply them at the enterprise level will go a long way to providing the advantages of the cloud for every day use. Thanks to companies like Hedgehog that are concentrating on the way to bring that market we have a chance to see it sooner than we’d hoped.

To learn more about Hedgehog and how they make SONiC easier for the enterprise, make sure to check out their website at https://githedgehog.com/

Who Pays The Price of Redundancy?

No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.

One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.

Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.

Backup Plans

I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.

What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.

Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.

Extending the Backup Bubble

What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?

There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?

You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.

The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.

However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.

Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.


Tom’s Take

I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.

Opening Up Remote Access with Opengear

Opengear OM2200

The Opengear OM2200

If you had told me last year at this time that remote management of devices would be a huge thing in 2020 I might have agreed but laughed quietly. We were traveling down the path of simultaneously removing hardware from our organizations and deploying IoT devices that could be managed easily from the cloud. We didn’t need to access stuff like we did in the past. Even if we did, it was easy to just SSH or console into the system from a jump box inside the corporate firewall. After all, who wants to work on something when you’re not in the office?

Um, yeah. Surprise, surprise.

Turns out 2020 is the Year of Having Our Hair Lit On Fire. Which is a catchy song someone should record. But it’s also the year where we have learned how to stand up 100% Work From Home VPN setups within a week, deploy architecture to the cloud and refactor on the fly to help employees stay productive, and institute massive change freezes in the corporate data center because no one can drive in to do a reboot if someone forgets to do commit confirmed or reload in 5.

Remote management has always been something that was nice to have. Now it’s something that you can’t live without. If you didn’t have a strategy for doing it before or you’re still working with technology that requires octal cables to work, it’s time you jumped into the 21st Century.

High Gear

Opengear is a company that has presented a lot at Tech Field Day. I remember seeing them for the first time when I was a delegate many, many years ago. As I have grown with Tech Field Day, so too have they. I’ve seen them embrace new technologies like cloud management and 4G/LTE connectivity. I’ve heard the crazy stories about fish farms and Australian emergency call boxes and even some stuff that’s too crazy to repeat. But the theme remains the same throughout it all. Opengear is trying to help admins keep their boxes running even if they can’t be there to touch them.

Flash forward to the Hair On Fire year, and Opengear is still coming right along. During the recent Tech Field Day Virtual Cisco Live Experience in June, they showed off their latest offerings for sweet, sweet hardware. Rob Waldie did a great job talking about their new line of NetOps Console servers here in this video:

Now, I know what you’re thinking. NetOps? Really? Trying to cash in on the marketing hype? I would have gone down that road if they hadn’t show off some of the cool things these new devices can do.

How about support for Docker containerized apps? Pretty sure that qualifies at NetOps, doesn’t it? Now, your remote console appliance is capable of doing things like running automation scripts and triggering complex logic when something happens. And, because containers are the way the cloud runs now, you can deploy any number of applications to the console server with ease. It’s about as close at an App Store model as you’re going to find, with some nerd knobs for good measure.

That’s not all though. The new line of console appliances also comes with an embedded trusted platform module (TPM) chip. You’ve probably seen these on laptops or other mobile devices. They do a great job of securing the integrity of the device. It’s super important to have if you’re going to deploy console servers into insecure locations. That way, no one can grab your device and do things they shouldn’t like tapping traffic or trying to do other nefarious things to compromise security.

Last but not least, there’s an option for 64GB of flash storage on the device. I like this because it means I can do creative things like back up configurations to the storage device on a regular basis just in case of an outage. If and when something happens I can just remote to the Opengear server, console to the device, and put the config back where it needs to be. Pretty handy if you have a device with a dying flash card or something that is subject to power issues on a regular basis. And with a LTE-A global cellular modem, you don’t have to worry about shipping the box to a country where it won’t work.


Tom’s Take

I realize that we’re not going to be quarantined forever. But this is a chance for us to see how much we can get done without being in the office. Remember all those budgets for fancy office chairs and the coffee service? They could go to buying Opengear console servers so we can manage devices without truck rolls. Money well spent on reducing the need for human intervention also means a healthier workforce. I trust my family to stay safe with our interactions. But if I have to show up at a customer site to reboot a box? Taking chances even under the best of circumstances. And the fewer chances we take in the short term, the healthier the long-term outlook becomes.

We may never get back to the world we had before. And we may never even find ourselves in a 100% Remote Work environment. But Opengear gives us options that we need in order to find a compromise somewhere in the middle.

If you’d like more information about Opengear’s remote access solutions, make sure you check out their website at http://Opengear.com

Disclaimer: As a staff member of Tech Field Day, I was present during Opengear’s virtual presentation. This post represents my own thoughts and opinions of their presentation. Opengear did not provide any compensation for this post, nor did they request any special consideration when writing it. The conclusions contained herein are mine alone and do not represent the views of my employer.

Is Bandwidth A Precious Resource?

During a recent episode of the Packet Pushers Podcast, Greg and Drew talked about the fact that bandwidth just keeps increasing and we live in a world where the solution to most problems is to just increase the pipeline to the data center or to the Internet. I came into networking after the heady days of ISDN lines everywhere and trying to do traffic shaping on slow frame relay links. But I also believe that we’re going to quickly find ourselves in a pickle when it comes to bandwidth.

Too Depressing

My grandparents were alive during the Great Depression. They remember what it was like to have to struggle to find food or make ends meet. That one singular experience transformed the way they lived their lives. If you have a relative or know of someone that lived through that time, you probably have noticed they have some interesting habits. They may keep lots of cash on hand stored in various places around the house. They may do things like peel labels from jelly jars and use them as cups. They may even go to great lengths to preserve as much as they can for reuse later “just in case”.

It’s not uncommon for this to happen in the IT world as well. People that have been marked by crazy circumstance develop defense mechanisms against. Maybe it’s making a second commit to a configuration to ensure it’s correct before being deployed. Maybe it’s always taking a text backup of a switchport before shutting it down in case an old bug wipes it clean. Or, as it relates to the topic above, maybe it’s a network engineer that grew up on slow ISDN circuits trying to optimize links for traffic when there is absolutely no need to do so.

People will work with what they’re familiar with. If they treat every link as slow and prone to congestion they’ll configure they QoS policies and other shaping features like they were necessary to keep a 128k link alive with a massive traffic load. Even if the link has so much bandwidth that it will never even trigger the congestion features.

Rational Exuberance

The flip side of the Great Depression grandparent is the relative that grew up during a time when everything was perfect and there was nothing to worry about. The term coined by Alan Greenspan to define this phenomenon was Irrational Exuberance, which is the idea that the stock market of 1996 was overvalued. It also has the connotation of meaning that people will believe that everything is perfectly fine and dandy when all is well right up to the point when the rug gets pulled out from underneath them.

Going back to our bandwidth example, think about a network engineer that has only ever known a world like we have today where bandwidth is plentiful and easily available. I can remember installing phone systems for a school that had gigabit fiber connectivity between sites. QoS policies were non-existent for them because they weren’t needed. When you have all the pipeline you can use you don’t worry about restraining yourself. You have a plethora of bandwidth capabilities.

However, you also have an issue with budgeting. Turns out that there’s no such thing as truly unlimited bandwidth. You’re always going to hit a cap somewhere. It could be the uplink from the server to the switch. Maybe it’s the uplink between switches on the leaf-spine fabric. It could even be the WAN circuit that connects you to the Internet and the public cloud. You’re going to hit a roadblock somewhere. And if you haven’t planned for that you’re going to be in trouble. Because you’re going to realize that you should have been planning for the day when your options ran out.

Building For Today

If you’re looking at a modern enterprise, you need to understand some truths.

  1. Bandwidth is plentiful. Until it isn’t. You can always buy bigger switches. Run more fiber. Create cross-connects to increase bandwidth between east-west traffic. But once you hit the wall of running out of switch ports or places to pull fiber you’re going to be done no matter what.
  2. No Matter How Much You Have, It Won’t Be Enough. I learned this one at IBM back in 2001. We had a T3 that ran the entire campus in Minnesota. They were starting to get constrained on bandwidth so they paid a ridiculous amount of money to have another one installed to increase the bandwidth for users. We saturated it in just a couple of months. No matter how big the circuit or how many you install, you’re eventually going to run out of room. And if you don’t plan for that you’re going to be in a world of trouble.
  3. Plan For A Rainy Day. If you read the above, you know you’re going to need to have a plan in place when the day comes that you run out of unlimited bandwidth. You need to have QoS policies ready to go. Application inspection engines can be deployed in monitor mode and ready at a moment’s notice to be enabled in hopes of restricting usage and prioritizing important traffic. Remember that QoS doesn’t magically create bandwidth from nothing. Instead, it optimizes what you have and ensures that it can be used properly by the applications that need it. So you have to know what’s critical and what can be left to best effort. That means you have to do the groundwork ahead of time so there are no surprises. You have to be vigilant too. Who would have expected last year that video conference traffic would be as important as it is today?

Tom’s Take

Bandwidth is just like any other resource. It’s not infinite. It’s not unlimited. It only appears that way because we haven’t found a way to fill up that pipe yet. For every protocol that tries to be a good steward and not waste bandwidth like OSPF, you have a newer protocol like Open/R that has never known the harsh tundra of an ISDN line. We can make the bandwidth look effectively limitless but only by virtue of putting smart limits in place early and understanding how to make things work more smoothly when the time comes. Bandwidth is precious and you can make it work for you with the right outlook.

Is ACI Coming For The CLI?

I’m soon to depart from Cisco Live Barcelona. It’s been a long week of fun presentations. While I’m going to avoid using the words intent and context in this post, there is one thing I saw repeatedly that grabbed my attention. ACI is eating Cisco’s world. And it’s coming for something else very soon.

Devourer Of Interfaces

Application-Centric Infrastructure has been out for a while and it’s meeting with relative success in the data center. It’s going up against VMware NSX and winning in a fair number of deals. For every person that I talk to that can’t stand it I hear from someone gushing about it. ACI is making headway as the tip of the spear when it comes to Cisco’s software-based networking architecture.

Don’t believe me? Check out some of the sessions from Cisco Live this year. Especially the Software-Defined Access and DNA Assurance ones. You’re going to hear context and intent a lot, as those are the key words for this new strategy. You know what else you’re going to hear a lot?

Contract. Endpoint Group (EPG). Policy.

If you’re familiar with ACI, you know what those words mean. You see the parallels between the data center and the push in the campus to embrace SD-Access. If you know how to create a contract for an EPG in ACI, then doing it in DNA Center is just as easy.

If you’ve never learned ACI before, you can dive right in with new DNA Center training and get started. And when you finally figured out what you’re doing, you can not only use those skills to program your campus LAN. You can extend them into the data center network as well thanks to consistent terminology.

It’s almost like Cisco is trying to introduce a standard set of terms that can be used to describe consistent behaviors across groups of devices for the purpose of cross training engineers. Now, where have we seen that before?

Bye Bye, CLI

Oh yeah. And, while you’re at it, don’t forget that Arista “lost” a copyright case against Cisco for the CLI and didn’t get fined. Even without the legal ramifications, the Cisco-based CLI has been living on borrowed time for quite a while.

APIs and Python make programming networks easy. Provided you know Python, that is. That’s great for DevOps folks looking to pick up another couple of libraries and get those VLANs tamed. But it doesn’t help people that are looking to expand their skillset without leaning an entirely new language. People scared by semicolons and strict syntax structure.

That’s the real reason Cisco is pushing the ACI terminology down into DNA Center and beyond. This is their strategy for finally getting rid of the CLI across their devices. Now, instead of dealing with question marks and telnet/SSH sessions, you’re going to orchestrate policies and EPGs from your central database. Everything falls into place after that.

Maybe DNA Center does some fancy Python stuff on the back end to handle older devices. Maybe there’s even some crazy command interpreters literally force-feeding syntax to an ancient router. But the end goal is to get people into the tools used to orchestrate. And that day means that Cisco will have a central location from which to build. No more archaic terminal windows. No more console cables. Just the clean purity of the user interface built by Insieme and heavily influenced by Cisco UCS Director.


Tom’s Take

Nothing goes away because it’s too old. I still have a VCR in my house. I don’t even use it any longer. It sits in a closet for the day that my wife decides she wants to watch our wedding video. And then I spend an hour hooking it up. But, one of these days I’m going to take that tape and transfer it to our Plex server. The intent is still the same – my wife gets to watch videos. But I didn’t tell her not to use the VCR. Instead, I will give her a better way to accomplish her task. And on that day, I can retire that old VCR to the same pile as the CLI. Because I think the ACI-based terminology that Cisco is focusing on is the beginning of the end of the CLI as we know it.

VMware and VeloCloud: A Hedge Against Hyperconvergence?

VMware announced on Thursday that they are buying VeloCloud. This was a big move in the market that immediately set off a huge discussion about the implications. I had originally thought AT&T would buy VeloCloud based on their relationship in the past, but the acquistion of Vyatta from Brocade over the summer should have been a hint that wasn’t going to happen. Instead, VMware swooped in and picked up the company for an undisclosed amount.

The conversations have been going wild so far. Everyone wants to know how this is going to affect the relationship with Cisco, especially given that Cisco put money into VeloCloud in both 2016 and 2017. Given the acquisition of Viptela by Cisco earlier this year it’s easy to see that these two companies might find themselves competing for marketshare in the SD-WAN space. However, I think that this is actually a different play from VMware. One that’s striking back at hyperconverged vendors.

Adding The Value

If you look at the marketing coming out of hyperconvergence vendors right now, you’ll see there’s a lot of discussion around platform. Fast storage, small footprints, and the ability to deploy anywhere. Hyperconverged solutions are also starting to focus on the hot new trends in compute, like containers. Along the way this means that traditional workloads that run on VMware ESX hypervisors aren’t getting the spotlight they once did.

In fact, the leading hyperconvergence vendor Nutanix has been aggressively selling their own hypervisor, Acropolis as a competitor to VMware. They tout new features and easy configuration as the major reason to use Acropolis over ESX. The push by Nutanix is to get their customers off of ESX and on to Acropolis to get a share of the VMware budget that companies are currently paying.

For VMware, it’s a tough sell to keep their customers on ESX. There’s a very big ecosystem of software out there that runs on ESX, but if you can replicate a large portion of it natively like Acropolis and other hypervisors do there’s not much of a reason to stick with ESX. And if the VMware solution is more expensive over time you will find yourself choosing the cheaper alternative when the negotiations come up for renewal.

For VMware NSX, it’s an even harder road. Most of the organizations that I’ve seen deploying hyperconverged solutions are not huge enterprises with massive centralized data centers. Instead, they are the kind small-to-medium businesses that need some functions but are very budget conscious. They’re also very geographically diverse, with smaller branch offices taking the place of a few massive headquarters locations. While NSX has some advantages for these companies, it’s not the best fit for them. NSX works optimally in a data center with high-speed links and a well-built underlay network.

vWAN with VeloCloud

So how is VeloCloud going to play into this? VeloCloud already has a lot of advantages that made them a great complement to VMware’s model. They have built-in multi tenancy. Their service delivery is virtualized. They were already looking to move toward service providers as their primary market, but network services and managed service providers. This sounds like their interests are aligning quite well with VMware already.

The key advantage for VMware with VeloCloud is how it will allow NSX to extend into the branch. Remember how I said that NSX loves an environment with a stable underlay? That’s what VeloCloud can deliver. A stable, encrypted VPN underlay. An underlay that can be managed from one central location, or in the future, perhaps even a vCenter plugin. That gives VeloCloud a huge advantage to build the underlay to get connectivity between branches.

Now, with an underlay built out, NSX can be pushed down into the branch. Branches can now use all the great features of NSX like analytics, some of which will be bolstered by VeloCloud, as well as microsegmentation and other heretofore unseen features in the branch. The large headquarters data center is now available in a smaller remote size for branches. That’s a huge advantage for organizations that need those features in places that don’t have data centers.

And the pitch against using other hypervisors with your hyperconverged solution? NSX works best with ESX. Now, you can argue that there is real value in keeping ESX on your remote branches is not costs or features that you may one day hope to use if your WAN connection gets upgraded to ludicrous speed. Instead, VeloCloud can be deployed between your HQ or main office and your remote site to bring those NSX functions down into your environment over a secure tunnel.

While this does compete a bit with Cisco from a delivery standpoint, it still doesn’t affect them with complete overlap. In this scenario, VeloCloud is a service delivery platform for NSX and not a piece of hardware at the edge. Absent VeloCloud, this kind of setup could still be replicated with a Cisco Viptela box running the underlay and NSX riding on top in the overlay. But I think that the market that VMware is going after is going to be building this from the ground up with VMware solutions from the start.


Tom’s Take

Not every issues is “Us vs. Them”. I get that VMware and Cisco seem to be spending more time moving closer together on the networking side of things. SD-WAN is a technology that was inevitably going to bring Cisco into conflict with someone. The third generation of SD-WAN vendors are really companies that didn’t have a proper offering buying up all the first generation startups. Viptela and VeloCloud are now off the market and they’ll soon be integral parts of their respective parent’s strategies going forward. Whether VeloCloud is focused on enabling cloud connectivity for VMware or retaking the branch from the hyperconverged vendors is going to play out in the next few months. But instead of focusing on conflict with anyone else, VeloCloud should be judged by the value it brings to VMware in the near term.

Back In The Saddle Of A Horse Of A Different Color

I’ve been asked a few times in the past year if I missed being behind a CLI screen or I ever got a hankering to configure some networking gear. The answer is a guarded “yes”, but not for the reason that you think.

Type Casting

CCIEs are keyboard jockeys. Well, the R&S folks are for sure. Every exam has quirks, but the R&S folks have quirky QWERTY keyboard madness. We spend a lot of time not just learning commands but learning how to input them quickly without typos. So we spend a lot of time with keys and a lot less time with the mouse poking around in a GUI.

However, the trend in networking has been to move away from these kinds of input methods. Take the new Aruba 8400, for instance. The ArubaOS-CX platform that runs it seems to have been built to require the least amount of keyboard input possible. The whole system runs with an API backend and presents a GUI that is a series of API calls. There is a CLI, but anything that you can do there can easily be replicated elsewhere by some other function.

Why would a company do this? To eliminate wasted effort. Think to yourself how many times you’ve typed the same series of commands into a switch. VLAN configuration, vty configs, PortFast settings. The list goes on and on. Most of us even have some kind of notepad that we keep the skeleton configs in so we can paste them into a console port to get a switch up and running quickly. That’s what Puppet was designed to replace!

By using APIs and other input methods, Aruba and other companies are hoping that we can build tools that either accept the minimum input necessary to configure switches or that we can eliminate a large portion of the retyping necessary to build them in the first place. It’s not the first command you type into a switch that kills you. It’s the 45th time you paste the command in. It’s the 68th time you get bored typing the same set of arguments from a remote terminal and accidentally mess this one up that requires a physical presence on site to reset your mistake.

Typing is boring, error prone, and costs significant time for little gain. Building scripts, programs, and platforms that take care of all that messy input for us makes us more productive. But it also changes the way we look at systems.

Bird’s Eye Views

The other reason why my fondness for keyboard jockeying isn’t as great as it could be is because of the way that my perspective has shifted thanks to the new aspects of networking technology that I focus on. I tell people that I’m less of an engineer now and more of an architect. I see how the technologies fit together. I see why they need to complement each other. I may not be able to configure a virtual link without documentation or turn up a storage LUN like I used to, but I understand why flash SSDs are important and how APIs are going to change things.

This goes all they way back to my conversations at VMunderground years ago about shifting the focus of networking and where people will go. You remember? The “ditch digger” discussion?

 

This is more true now than ever before. There are always going to be people racking and stacking. Or doing basic types of configuration. These folks are usually trained with basic knowledge of their task with no vision outside of their job role. Networking apprentices or journeymen as the case may be. Maybe one out of ten or one out of twenty of them are going to want to move up to something bigger or better.

But for the people that read blogs like this regularly the shift has happened. We don’t think in single switches or routers. We don’t worry about a single access point in a closet. We think in terms of systems. We configure routing protocols across multiple systems. We don’t worry about a single port VLAN issue. Instead, we’re trying to configure layer 2 DCI extensions or bring racks and pods online at the same time. Our visibility matters more than our typing skills.

That’s why the next wave of devices like the Aruba 8400 and the Software Defined Access things coming from Cisco are more important than simple checkboxes on a feature sheet. They remove the visibility of protocols and products and instead give us platforms that need to be configured for maximum effect. The gap between the people that “rack and stack” and those that build the architecture that runs the organization has grown, but only because the middle ground of administration is changing so fast that it’s tough to keep up.


Tom’s Take

If I were to change jobs tomorrow I’m sure that I could get back in the saddle with a couple of weeks of hard study. But the question I keep asking myself is “Why would I want to?” I’ve learned that my value doesn’t come from my typing speed or my encyclopedia of networking command arguments any more. It comes from a greater knowledge of making networking work better and integrate more tightly into the organization. I’m a resource, not a reactionary. And so when I look to what I would end up doing in a new role I see myself learning more and more about Python and automation and less about what new features were added in the latest OSPF release on Cisco IOS. Because knowing how to integrate technology at a high level is more valuable to everyone than just knowing the commands to type to turn the lights on.

Changing The Baby With The Bathwater In IT

If you’re sitting in a presentation about the “new IT”, there’s bound to be a guest speaker talking about their digital transformation or service provider shift in their organization. You can see this coming. It’s a polished speaker, usually a CIO or VP. They talk about how, with the help of the vendor on stage with them, they were able to rapidly transform their infrastructure into something modern while at the same time changing processes to accommodate faster IT response, more productive workers, and increase revenue or transform IT from a cost center to a profit center. The key components are simple:

  1. Buy new infrastructure from $vendor
  2. Transform all processes to be more agile, productive, and better.

Why do those things always happen in concert?

Spring Cleaning

Infrastructure grows old. That’s a fact of life. Outside of some very specialized hardware, no one is using the same desktop they had ten years ago. No enterprise is still running Windows 2000 server on an IBM NetFinity server. No one is still using 10Mbps Ethernet over Thinnet to connect their offices. Hardware marches on. So when we buy new things, we as technology professionals need to find a way to integrate them into our existing technology stack.

Processes, on the other hand, are very slow to change. I can remember dealing with process issues when I was an intern for IBM many, many years ago. The process we had for deploying a new workstation had many, many reboots involved. The deployment team worked out a new strategy to streamline deployments and make things run faster. We brought our plan to the head of deployments. From there, we had to:

  • Run tests to prove that it was faster
  • Verify that the process wasn’t compromised in any way
  • Type up new procedures in formal language to match the existing docs
  • Then submit them for ISO approval

And when all those conditions were met, we could finally start using our process. All in all, with aggressive testing, it still took two months.

Processes are things that are thought to be carved in stone, never to be modified or changed in any way for the rest of time. Unless the stones break or something major causes a process change. Usually, that major change is a whole truckload of new equipment showing up on the back dock attached to a consultant telling IT there is a better way (TM) to do things.

Ceteris Paribus

Ceteris Paribus is a latin term that means “all else unchanged”. We use it when we talk about having multiple variables in an equation and the need to keep them constant to be able to measure changes appropriately.

The funny thing about all these transformations is that it’s hard to track what actually made improvements when you’re changing so many things at once. If the new hardware is three or four times faster than your old equipment, would it show that much improvement if you just used your old software and processes on it? How much faster could your workloads execute with new CPUs and memory management techniques? How about collapsing your virtual infrastructure onto fewer and fewer physical servers because of advances there? Running old processes on new hardware can give you a very good idea of how good the hardware is. Does it meet the criteria for selection that you wanted when it was purchased? Or, better still, does it seems like you’re not getting the performance you paid for?

Likewise, how are you able to know for sure that the organization and process changes you implemented actually did anything? If you’re implementing them on new hardware how can you capture the impact? There’s no rule that says that new processes can only be implemented on new shiny hardware. Take a look at what Walmart is doing with OpenStack. They most certainly aren’t rushing out to buy tons and tons of new servers just for OpenStack integration. Instead, they are taking streamlined processes and implementing them on existing infrastructure to see the benefits. Then it’s easy to measure and say how much hardware you need to expand instead of overbuying for the process changes you make.


Tom’s Take

So, why do these two changes always seem to track with each other? The optimist in me wants to believe that it’s people deciding to make positive changes all at once to pull their organization into the future. Since any installation is disruptive, it’s better to take the huge disruption and retrain for the massive benefits down the road. It’s a rosy picture indeed.

The pessimist in me wonders if all these massive changes aren’t somehow tied to the fact that they always come with massive new hardware purchases from vendors. I would hope there isn’t someone behind the scenes with the ear of the CIO pushing massive changes in organization and processes for the sake of numbers. I would also sincerely hope that the idea isn’t to make huge organizational disruptions for the sake of “reducing overhead” or “helping tell the world your story” or, worse yet, “making our product look good because you did such a great job with all these changes”.

The optimist in me is hoping for the best. But the pessimist in me wonders if reality is a bit less rosy.

Extreme-ly Interesting Times In Networking

If you’re a fan of Extreme Networks, the last few months have been pretty exciting for you. Just yesterday, it was announced that Extreme is buying the data center networking business of Brocade for $55 million once the Broadcom acquisition happens. Combined with the $100 million acquisition of Avaya’s campus networking portfolio on March 7th and the purchase of Zebra Wireless (nee Motorola) last September, Extreme is pushing itself into the market as a major player. How is that going to impact the landscape?

Building A Better Business

Extreme has been a player in the wireless space for a while. Their acquisition of Enterasys helped vault them into the mix with other big wireless players. Now, the rounding out of the portfolio helps them complete across the board. They aren’t just limited to playing with stadium wifi and campus technologies now. The campus networking story that was brought in through Avaya was a must to help them compete with Aruba, A Hewlett Packard Enterprise Company. Aruba owns the assets of HPE’s campus networking business and has been leveraging them effectively.

The data center play was an interesting one to say the least. I’ve mused recently that Brocade’s data center business may end up lying fallow once Arris grabbed Ruckus. Brocade had some credibility in very large networks through VCS and the MLX router series, but outside of the education market and specialized SDN deployments it was rare to encounter them. Arista has really dug into Cisco’s market share here and the rest of the players seem to be content to wait out that battle. Juniper is back in the carrier business, and the rest seem to be focusing now on OCP and the pieces that flow logically from that, such as Six Pack, Backpack, and Whatever Facebook Thinks The Next Fast Switch Should Be Called That Ends In “Pack”.

Seeing Extreme come from nowhere to snap up the data center line from Brocade signals a new entrant into the data center crowd. Imagine, if you will, a mosh pit. Lots of people fighting for their own space to do their thing. Two people in the middle have decided to have an all-out fight over their space. Meanwhile, everyone else is standing around watching them. Finally, a new person enters the void of battle to do their thing on the side away from the fistfight that has captured everyone’s attention. This is where Extreme finds itself now.

Not Too Extreme

The key for Extreme now is to tell the “Full Stack” story to customers. Whereas before they had to hand off the high end to another “frenemy” and hope that it didn’t come back to bite them, now Extreme can sell all the way up and down the stack. They have some interesting ideas about SDN that will bear some watching as they begin to build them into their stack. The integration of VCS into their portfolio will take some time, as the way that Brocade does their fabric implementation is a bit different than the rest of the world.

This is also a warning call for the rest of the industry. It’s time to get off the sidelines and choose your position. Arista and Cisco won’t be fighting forever. Cisco is also reportedly looking to create a new OS to bring some functionality to older devices. That means that they can continue and try to innovate while fighting against their competitors. The winner of the Cisco and Arista battle is inconsequential to the rest of the industry right now. Either Arista will be wiped off the map and a stronger Cisco will pick a new enemy, or Arista will hurt Cisco and pull even with them in the data center market, leaving more market share for others to gobble up.

Extreme stands a very good chance of picking up customers with their approach. Customers that wouldn’t have considered them in the past will be lining up to see how Avaya campus gear will integrate with Enterasys wireless and Brocade data center gear. It’s not all the different from the hodge-podge approach that many companies have picked for years to lower costs and avoid having a single vendor solution. Now, those lower cost options are available in a single line of purple boxes.


Tom’s Take

Who knew we were going to get a new entrant into the Networking Wars for the tidy sum of $155 million? Feels like it should have cost more than that, but given the number of people holding fire sales to get rid of things they have to divest before pending acquisition or pending dissolution, it really doesn’t come as much surprise. Someone had to buy these pieces and put them together. I think Extreme is going to turn some heads and make some for some interesting conversations in the next few months. Don’t count them out just yet.