Pushing Everyone’s Buttons In IT

HistoryEraserButton

We have officially reached the point in our long and storied IT careers where we, as old fogies, have earned the right to complain about the next generation of users and professionals. Just as the gray beards before us complained about the way we did things, so too is it our turn to moan about the state of affairs. Today, I’d like to point out how driving IT to the point of pushing simple buttons is destroying the way we do things.

Easy Buttons

The fact that IT work has been able to be distilled into a series of simple button pushing exercises is very thrilling. We’ve spent a lot of time and effort building devices and frameworks that take the hard part out of building devices and frameworks. We no longer have to invent languages to build things or hardware to do things. Instead, we can refine our programming capabilities or use general purpose hardware in new combinations to provide environments for our users.

That’s one of the things that is driving people to the cloud. Cloud isn’t just about exciting hardware or keeping your data in other places. It is just as much about predictable, repeatable frameworks and workflows that let people accomplish tasks without much thought. Just like the assembly line of the eighties, we are take boring repetitive tasks away from people and giving them to machines that don’t get bored and love repetition. Doing that drives costs down and makes people more productive.

But what happens when those processes break down? What happens when the magic smoke is let out of the machine? That’s when the real issues start cropping up. Perhaps the framework developers are great at figuring out what exactly went wrong in their solution. However, they’re likely clueless about all of the things that their solution is built upon. Imagine if a mechanic couldn’t diagnose the engine or the various subsystems of your car? You’d have a fit, right?

The troubleshooting level of modern framework engineering only extends to the edge of their solution. Once you get into a problem with a service they’ve leveraged, such as AWS, then the buck is passed and the troubleshooting must move to a new team. The problem with having your teams building next-generation IT services on top of existing things is that they lose visibility into the things they didn’t build. It becomes a vicious cycle when those first-level services aren’t reliable enough to keep your solution running.

Push A Button, Push A Button!

As bad as things are for IT departments in a push-button world, it’s even crazier for users in the same boat. That’s because users understand two extremes of service delivery. Either the thing is working or it isn’t. There is no in between. In the old days of non-instant IT, users would just keep asking until a thing was done.

Today, things are much different. Automation and orchestration has allowed frameworks and platforms to be able to instantly create services. That means users have been able to create their own platform for building things. So anything that isn’t instant is broken. Take, for example, a recent discussion of bandwidth from the spring ONUG meeting. An IT professional was frustrated that it took the better part of a day to transfer 1 petabyte of information across the country. He wasn’t upset that it failed, mind you. He was mad because it didn’t happen in five minutes. Non-instant things are broken.

The story repeats itself over and over again. Networking resources that can’t be automatically provisioned are broken. Cloud services that need more than a few minutes to spin up aren’t working correctly. Mobile apps and sites that don’t instantly pop up on your phone must be working incorrectly. The patience level of a user isn’t even a tenth of the average IT professional. IT pros know why something is running slow. Users fall back on slow things being broken.


Tom’s Take

The problem is investment. Given an infinite amount of funding, everything can be fast. But people are trying to find ways to not pay for things in today’s IT world. Maybe they want to pay for AWS because it works. But OpenStack gives the promise of AWS for free. Except things like OpenStack and Ansible aren’t free. The currency you use to pay for them is time.

Time is just as precious a commodity as money. We don’t get refunds on time. We can’t ask for discounts on time. We can only invest and hope that there is a payoff down the road. The real outcome of this time investment looks very similar to the environment we have today. Things should just work and allow us to build new things on top of them. But that missing time investment is the key to the whole enterprise. If we don’t spend the time building and extending things, then we won’t have the expertise we need to fix things when they break. That time investment also helps everyone appreciate just how hard it is to build things in the first place. And that’s something no button will every duplicate.

BGP: The Application Networking Dream

bgp

There was an interesting article last week from Fastly talking about using BGP to scale their network. This was but the latest in a long line of discussions around using BGP as a transport protocol between areas of the data center, even down to the Top-of-Rack (ToR) switch level. LinkedIn made a huge splash with it a few months ago with their Project Altair solution. Now it seems company after company is racing to implement BGP as the solution to their transport woes. And all because developers have finally pulled their heads out of the sand.

BGP Under Every Rock And Tree

BGP is a very scalable protocol. It’s used the world over to exchange routes and keep the Internet running smoothly. But it has other power as well. It can be extended to operate in other ways beyond the original specification. Unlike rigid protocols like RIP or OSPF, BGP was designed in part to be extended and expanded as needs changes. IS-IS is a very similar protocol in that respect. It can be upgraded and adjusted to work with both old and new systems at the same time. Both can be extended without the need to change protocol versions midstream or introduce segmented systems that would run like ships in the night.

This isn’t the first time that someone has talked about running BGP to the ToR switch either. Facebook mentioned in this video almost three years ago. Back then they were solving some interesting issues in their own data center. Now, those changes from the hyperscale world are filtering into the real world. Networking teams are seeking to solve scaling issues without resorting to overlay networks or other types of workarounds. The desire to fix everything wrong with layer 2 has led to a revelation of sorts. The real reason why BGP is able to work so well as a replacement for layer 2 isn’t because we’ve solved some mystical networking conundrum. It’s because we finally figured out how to build applications that don’t break because of the network.

Apps As Far As The Eye Can See

The whole reason when layer 2 networks are the primary unit of data center measurement has absolutely nothing to do with VMware. VMware vMotion behaves the way that it does because legacy applications hate having their addresses changed during communications. Most networking professionals know that MAC addresses have a tenuous association to IP addresses, which is what allows the gratuitous ARP after a vMotion to work so well. But when you try to move an application across a layer 3 boundary, it never ends well.

When web scale companies started building their application stacks, they quickly realized that being pinned to a particular IP address was a recipe for disaster. Even typical DNS-based load balancing only seeks to distribute requests to a series of IP addresses behind some kind of application delivery controller. With legacy apps, you can’t load balance once a particular host has resolved a DNS name to an IP address. Once the gateway of the data center resolves that IP address to a MAC address, you’re pinned to that device until something upsets the balance.

Web scale apps like those built by Netflix or Facebook don’t operate by these rules. They have been built to be resilient from inception. Web scale apps don’t wait for next hop resolution protocols (NHRP) or kludgy load balancing mechanisms to fix their problems. They are built to do that themselves. When problems occur, the applications look around and find a way to reroute traffic. No crazy ARP tricks. No sly DNS. Just software taking care of itself.

The implications for network protocols are legion. If a web scale application can survive a layer 3 communications issue then we are no longer required to keep the entire data center as a layer 2 construct. If things like anycast can be used to pin geolocations closer to content that means we don’t need to worry about large failover domains. Just like Ivan Pepelnjak (@IOSHints) says in this post, you can build layer 3 failure domains that just work better.

BGP can work as your ToR strategy for route learning and path selection because you aren’t limited to forcing applications to communicate at layer 2. And other protocols that were created to fix limitations in layer 2, like TRILL or VXLAN, become an afterthought. Now, applications can talk to each other and fail back and forth as they need to without the need to worry about layer 2 doing anything other than what it was designed to do: link endpoints to devices designed to get traffic off the local network and into the wider world.


Tom’s Take

One of the things that SDN has promised us is a better way to network. I believe that the promise of making things better and easier is a noble goal. But the part that has bothered me since the beginning was that we’re still trying to solve everyone’s problems with the network. We don’t rearrange the power grid every time someone builds a better electrical device. We don’t replumb the house overtime we install a new sink. We find a way to make the new thing work with our old system.

That’s why the promise of using BGP as a ToR protocol is so exciting. It has very little to do with networking as we know it. Instead of trying to work miracles in the underlay, we build the best network we know how to build. And we let the developers and programmers do the rest.

The Death of TRILL

wasteland_large

Networking has come a long way in the last few years. We’ve realized that hardware and ASICs aren’t the constant that we could rely on to make decisions in the next three to five years. We’ve thrown in with software and the quick development cycles that allow us to iterate and roll out new features weekly or even daily. But the hardware versus software battle has played out a little differently than we all expected. And the primary casualty of that battle was TRILL.

Symbiotic Relationship

Transparent Interconnection of Lots of Links (TRILL) was proposed as a solution to the complexity of spanning tree. Radia Perlman realized that her bridging loop solution wouldn’t scale in modern networks. So she worked with the IEEE to solve the problem with TRILL. We also received Shortest Path Bridging (SPB) along the way as an alternative solution to the layer 2 issues with spanning tree. The motive was sound, but the industry has rejected the premise entirely.

Large layer 2 networks have all kinds of issues. ARP traffic, broadcast amplification, and many other numerous issues plague layer 2 when it tries to scale to multiple hundreds or a few thousand nodes. The general rule of thumb is that layer 2 broadcast networks should never get larger than 250-500 nodes lest problems start occurring. And in theory that works rather well. But in practice we have issues at the software level.

Applications are inherently complicated. Software written in the pre-Netflix era of public cloud adoption doesn’t like it when the underlay changes. So things like IP addresses and ARP entries were assumed to be static. If those data points change you have chaos in the software. That’s why we have vMotion.

At the core, vMotion is a way for software to mitigate hardware instability. As I outlined previously, we’ve been fixing hardware with software for a while now. vMotion could ensure that applications behaved properly when they needed to be moved to a different server or even a different data center. But they also required the network to be flat to overcome limitations in things like ARP or IP. And so we went on a merry journey of making data centers as flat as possible.

The problem came when we realized that data centers could only be so flat before they collapsed in on themselves. ARP and spanning tree limited the amount of traffic in layer 2 and those limits were impossible to overcome. Loops had to be prevented, yet the simplest solution disabled bandwidth needed to make things run smoothly. That caused IEEE and IETF to come up with their layer 2 solutions that used CLNS to solve loops. And it was a great idea in theory.

The Joining

In reality, hardware can’t be spun that fast. TRILL was used as a reference platform for proprietary protocols like FabricPath and VCS. All the important things were there but they were locked into hardware that couldn’t be easily integrated into other solutions. We found ourselves solving problem after problem in hardware.

Users became fed up. They started exploring other options. They finally decided that hardware wasn’t the answer. And so they looked to software. And that’s where we started seeing the emergence of overlay networking. Protocols like VXLAN and NV-GRE emerged to tunnel layer 2 packets over layer 3 networks. As Ivan Pepelnjak is fond of saying layer 3 transport solves all of the issues with scaling. And even the most unruly application behaves when it thinks everything is running on layer 2.

Protocols like VXLAN solved an immediate need. They removed limitations in hardware. Tunnels and fabrics used novel software approaches to solve insurmountable hardware problems. An elegant solution for a thorny problem. Now, instead of waiting for a new hardware spin to fix scaling issues, customers could deploy solutions to fix the issues inherent in hardware on their own schedule.

This is the moment where software defined networking (SDN) took hold of the market. Not when words like automation and orchestration started being thrown about. No, SDN became a real thing when it enabled customers to solve problems without buying more physical devices.


Tom’s Take

Looking back, we realize now that building large layer 2 networks wasn’t the best idea. We know that layer 3 scales much better. Given the number of providers and end users running BGP to top-of-rack (ToR) switches, it would seem that layer 3 scales much better. It took us too long to figure out that the best solution to a problem sometimes takes a bit of thought to implement.

Virtualization is always going to be limited by the infrastructure it’s running on. Applications are only as smart as the programmer. But we’ve reached the point where developers aren’t counting on having access to layer 2 protocols that solve stupid decision making. Instead, we have to understand that the most resilient way to fix problems is in the software. Whether that’s VXLAN, NV-GRE, or a real dev team not relying on the network to solve bad design decisions.

Is Interop Dead?

interop_logo_blk

I’m at Interop this week talking all things networking with a great group of people. There are quite a few members of the community here presenting, listening and discussing. There’s a great exchange of ideas flowing back and forth. Yet one thing I keep hearing in quiet corners of the room is a hushed discussion of the continued viability of Interop as a conference. Is it time to write the Interop obituary?

Only Mostly Dead

Some of the arguments are as old as tech itself. People claim that getting vendors to interoperate today is an afterthought thanks to protocols like OSPF. All of the important bits in a network are standardized now. Use of APIs and other open technologies are driving vendors to play nice with each other. The need to show up in a faraway place and do the work has long passed.

There’s also the discussion around the bigger conferences out in the world. Vendor conferences like Cisco Live and VMworld draw tens of thousands. New product announcements are dropping left and right during these events. People also want to fracture into tool-specific events like OpenStack Summit or DockerCon. Or the various analyst events or company days that happen every month. Why have a conference like Interop when others do it bigger?

Lastly, there’s the argument that the idea of an expo floor is long gone. Why should we get a booth on the show floor to show off our solution? Why not just approach people directly? Why spend money to be there like everyone else. Companies that stand out get noticed. Companies that leverage social media and SEO get the business. Not companies that buy space on a floor somewhere.

Exaggerated Rumors

Let’s jump on these one at a time.

First, claiming that all vendors play nice today thanks to standard protocols is misguided at best and downright silly at worst. Just because OSPF is standard doesn’t mean that people aren’t going to bend it to their own needs. Remember totally stubby areas? Those were very Cisco-specific for a long time. And throwing APIs into the discussion muddies the waters even further. Just because you have an API attached to your software doesn’t mean you can interoperate. How well is your API documented? Do I need to buy SDK access to get into it? Is my code portable? Do you do something silly like not supporting Python but loving things like Java and Perl?

Interoperability is a problem that hasn’t been solved completely. Even if everything works on paper and in Powerpoint slides, there’s still an acid test when you plug it all in for real and make it work. That moment of relief when two different vendor’s devices come up with the same protocols and everything talks is a magical time. You can’t replicate that in documentation. There’s still a huge need to have companies show up and physically prove that things work in a neutral setting versus a heavily slanted bakeoff lab somewhere.

Secondly, let’s look at those other conferences. Sure, Cisco Live has 20,000 people. That all want to talk about Cisco. There isn’t a whole lot of room for Juniper, Brocade, or HPE to be there. Differing opinions aren’t welcome. At best they are discussed in hushed tones in the corners of the room (sound familiar?). Mindshare is important, but so are honest discussions. As for smaller conferences focused on specific tools like OpenStack Summit, you quickly see the limits of things when you start talking to attendees about other things like pieces of the stack not addressed by the solution. There’s importance in being able to talk about all the parts without being overly myopic on one part.

The other piece of the other conferences refers back to a conversation that happened on Twitter last month about community content in large vendor conferences. There was talk from a number of people about how vendor conferences freeze out community content because people pay big bucks to come here a Technical Marketing Engineer read slides about configuring a feature. It’s a far cry from the deep discussion and analysis that you get in other places. How do you work around bugs in code? What happens when a feature is missing? Communities solve problems. Big conferences do a bad job of getting community involvement outside of expo floor crawls and keynotes. If they didn’t, we wouldn’t need the amazing work of vBrownBag and Security BSides. Community matters to people.

Thirdly, the expo floor discussion. I’ll admit that I find expo floors to be personally challenging, but to eschew them in favor of a completely social strategy or relying on SEO to pop up on people’s radar is a recipe for disaster. Being able to physically talk to a person is still a valuable part of the process. How do you feel about their approach? Are they happy with the technology? Does it work in the physical demo? Do you get the sense that they are pushing too hard or trying to close you on a sale without giving you what you need. That’s an experience you don’t get via email or DM on Twitter. Whether or not my personal feelings matter, the truth is that the expo does still matter.


Tom’s Take

I’m biased. I’ve loved Interop for a number of years even before I got involved in the programming of it. The idea that people show up to prove definitively that their stuff works with all the other stuff is a gut check like no other. In the last couple of years the planners of the conference have worked hard to make sure that the programming has reflected the kinds of things that people need to be aware of on the horizon and what they need to be learning. Less focus on vendor sales pitches and more on independent content. Lean and mean and ready to fight the rest of the conference world to prove that content is still king.

As long as I’m involved, I can promise that any rumors of Interop’s impending death with stay just that – talk and rumor. There’s still a lot of life left in this grand event to make it matter to people that should matter.