DevOps and the Infrastructure Dumpster Fire


dumpsterfire2

We had a rousing discussion about DevOps at Cloud Field Day this week. The delegates talked about how DevOps was totally a thing and it was the way to go. Being the infrastructure guy, I had to take a bit of umbrage at their conclusions and go on a bit of a crusade myself to defend infrastructure from the predations of developers.

Stable, Boy

DevOps folks want to talk about continuous integration and continuous deployment (CI/CD) all the time. They want the freedom to make changes as needed to increase bandwidth, provision ports, and rearrange things to fit development timelines and such. It’s great that they have they thoughts and feelings about how responsive the network should be to their whims, but the truth of infrastructure today is that it’s on the verge of collapse every day of the week.

Networking is often a “best effort” type of configuration. We monkey around with something until it works, then roll it into production and hope it holds. As we keep building more patches on to of patches or try to implement new features that require something to be disabled or bypassed, that creates a house of cards that is only as strong as the first stiff wind. It’s far too easy to cause a network to fall over because of a change in a routing table or a series of bad decisions that aren’t enough to cause chaos unless done together.

Jason Nash (@TheJasonNash) said that DevOps is great because it means communication. Developers are no longer throwing things over the wall for Operations to deal with. The problem is that the boulder they were historically throwing over in the form of monolithic patches that caused downtime was replaced by the storm of arrows blotting out the sun. Each individual change isn’t enough to cause disaster, but three hundred of them together can cause massive issues.

arrows-blot-out-the-sun-800x500

Networks are rarely stable. Yes, routing tables are mostly stabilized so long as no one starts withdrawing routes. Layer 2 networks are stable only up to a certain size. The more complexity you pile on networks, the more fragile they become. The network really only is one fat-fingered VLAN definition or VTP server mode foul up away from coming down around our ears. That’s not a system that can support massive automation and orchestration. Why?

The Utility of Stupid Networking

The network is a source of pain not because of finicky hardware, but because of applications and their developers. When software is written, we have to make it work. If that means reconfiguring the network to suit the application, so be it. Networking pros have been dealing with crap like this for decades. Want proof?

  1. Applications can’t take to multiple gateways at a time on layer 2 networks. So lets create a protocol to make two gateways operate as one with a fake MAC address to answer requests and ensure uptime. That’s how we got HSRP.
  2. Applications can’t survive having the IP address of the server changed. Instead of using so many other good ideas, we create vMotion to allow us to keep a server on the same layer 2 network and change the MAC <-> IP binding. vMotion and the layer 2 DCI issues that it has caused has kept networking in the dark for the last eight years.
  3. Applications that run don’t need to be rewritten to work in the cloud. People want to port them as-is to save money. So cloud networking configurations are a nightmare because we have to support protocols that shouldn’t even be used for the sake of legacy application support.

This list could go on, but all these examples point to one truth: The application developers have relied upon the network to solve their problems for years. So the network is unstable because it’s being extended beyond the use case. Newer applications, like Netflix and Facebook, and thrive in the cloud because they were written from the ground up to avoid using layer 2 DCI or even operate at layer 2 beyond the minimum amount necessary. They solve tricky problems like multi host combinations and failover in the app instead of relying on protocols from the golden age of networking to fix it quietly behind the scenes.

The network needs to evolve past being a science project for protocols that aim to fix stupid application programming decisions. Instead, the network needs to evolve with an eye toward stability and reduced functionality to get there. Take away the ability to even try to do those stupid tricks and what you’re left with is a utility that is a resource for your developers. They can use it for transport without worrying about it crashing every day with some bug in a protocol no one has used in five years yet was still installed just in case someone turned on an old server accidentally.

Nowhere is this more apparent than cloud networking stacks like AWS or Microsoft Azure. There, the networking is as simplistic as possible. The burden for advanced functionality per group of users isn’t pushed into a realm where network admins need to risk outages to fix a busted application. Instead, the app developers can use the networking resources in a basic way to encourage them to think about failover and resiliency in a new way. It’s a brave new world!


Tom’s Take

I’ll admit that DevOps has potential. It gets the teams talking and helps analyze build processes and create more agile applications. But in order for DevOps to work the way it should, it’s going to need a stable platform to launch from. That means networking has to get its act together and remove the unnecessary things that can cause bad interactions. This was caused in part by application developers taking the easy road and pushing against the networking team of wizards. When those wizards push back and offer reduced capabilities countered against more uptime and fewer issues you should start to see app developers coming around to work with the infrastructure teams to get things done. And that is the best way to avoid an embarrassing situation that involves fire.

Advertisements

9 thoughts on “DevOps and the Infrastructure Dumpster Fire

  1. Great article, Tom.

    As a network guy, I’ve seen first hand how the network gets pushed to its upper limit, even out of its intended functionality. This is how we ended up with stuff like OTV, LISP, etc. that are nightmarish to deploy and require specialized hardware/even sometimes skill sets. These are also the types of complexities that remove functional predictability and cause extended outages because of bugs (features?), lack of understanding, etc. Our dev team and I are working together to back out of these old habits for apps and fix them the “Right Way(tm)”. End result is a simpler, more resilient network where we can more easily predict how things will behave on a day-to-day basis. KISS is stable and KISS is good.

    The best takeaway from all this cloud, fabric, devops, etc. stuff going on nowadays is the new mantra that it’s the application that matters. Precisely why we need to start designing multi-site aware apps instead of being so tunnel-visioned into network/system availability (looking at you, vMotion.) Nobody cares about the network/system if the app is available. We design clean, autonomous network and system architectures, thus, a more stable service overall.

    I’ve been in this field for ~16 years now, and I’m good with not being the guy holding the tub of glue for once. 😉

  2. Great article, Tom.

    As a network guy, I’ve seen first hand how the network gets pushed to its upper limit, even out of its intended functionality. This is how we ended up with stuff like OTV, LISP, etc. that are nightmarish to deploy and require specialized hardware/even sometimes skill sets. These are also the types of complexities that remove functional predictability and cause extended outages because of bugs (features?), lack of understanding, etc. Our dev team and I are working together to back out of these old habits for apps and fix them the “Right Way(tm)”. End result is a simpler, more resilient network where we can more easily predict how things will behave on a day-to-day basis. KISS is stable and KISS is good.

    The best takeaway from all this cloud, fabric, devops, etc. stuff going on nowadays is the new mantra that it’s the application that matters. Precisely why we need to start designing multi-site aware apps instead of being so tunnel-visioned into network/system availability (looking at you, vMotion.) Nobody cares about the network/system if the app is available. We design clean, autonomous network and system architectures, thus, a more stable service overall.

    I’ve been in this field for ~16 years now, and I’m good with not being the guy holding the tub of glue for once.😉

  3. nice article… I would add one thing though. Cloud provides the illusion of simplicity. Cloud infrastructure is not simple. Consider that every cloud provider has teams of PhDs not only designing networks but designing their own gear. Serverless cloud is the absolute peak of contorting infrastructure to accommodate developers that want the least amount of friction or thought with respect to how the underlying infrastructure serves their applications. I guess what I’m saying is… cloud is simple for the consumer, it’s most certainly not simple for the cloud infrastructure people.

  4. Pingback: Is DevOps a Load of BS? | snoopj's Blog

  5. Pingback: Reaction: Devops and Dumpster Fires - 'net work

  6. Dev’s goal is to make things different (improved, faster, newer). Ops’ goal is stability, reliability, which making things “different” interferes with. That tension, between introducing instability and maintaining stability, has been a healthy check and balance throughout the history of IT. And because these objectives of Dev and Ops were, are and always will be different, attempts to unite them feel strange and artificial.

  7. Pingback: DevOps and the Infrastructure Dumpster Fire - Tech Field Day

  8. Pingback: Quali CloudShell 7.1 Adds Application Deployment for Hybrid Clouds with Support for AWS Cloud - Hosting Journalist

  9. Pingback: Weaveworks Announces General Availability of Weave Cloud - Hosting Journalist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s