Linux and the Quest for Underlays

TuxUnderlay

I’m at the OpenStack Summit this week and there’s a lot of talk around about building stacks and offering everything needed to get your organization ready for a shift toward service provider models and such. It’s a far cry from the battles over software networking and hardware dominance that I’m so used to seeing in my space. But one thing came to mind that made me think a little harder about architecture and how foundations are important.

Brick By Brick

The foundation for the modern cloud doesn’t live in fancy orchestration software or data modeling. It’s not because a retailer built a self-service system or a search engine giant decided to build a cloud lab. The real reason we have a growing market for cloud providers today is because of Linux. Linux is the underpinning of so much technology today that it’s become nothing short of ubiquitous. Servers are built on it. Mobile operating systems use it. But no one knows that’s what they are using. It’s all just something running under the surface to enable applications to be processed on top.

Linux is the vodka of operating systems. It can run in a stripped down manner on a variety of systems and leave very little trace behind. BSD is similar in this regard but doesn’t have the driver support from manufacturers or the ability to strip out pieces down to the core kernel and few modifications. Linux gives vendors and operators the flexibility to create a software environment that boots and gets basic hardware working. The rest is up to the creativity of the people writing the applications on top.

Linux is the perfect underlay. It’s a foundation that is built upon without getting in the way of things running above it. It gives you predictable performance and a familiar environment. That’s one of the reasons why Cumulus Networks and Dell have embraced Linux as a way to create switch operating systems that get out of the way of packet processing and let you build on top of them as your needs grow and change.

Break The Walls Down

The key to building a good environment is a solid underlay, whether it be be in systems or in networking. With reliable transport and operations taken care of, amazing things can be built. But that doesn’t mean that you need to build a silo around your particular area of organization.

The shift to clouds and stacks and “new” forms of IT management aren’t going to happen if someone has built up a massive blockade. They will work when you build a system that has common parts and themes and allows tools to work easily on multiple parts of the infrastructure.

That’s what’s made Linux such a lightning rod. If your monitoring tools can monitor servers, SANs, and switches with little to no modification you can concentrate your time on building on those pieces instead of writing and rewriting software to get you back to where you started in the first place. That’s how systems can be extensible and handle changes quickly and efficiently. That’s how you build a platform for other things.


Tom’s Take

I like building Lego sets. But I really like building them with the old fashioned basic bricks. Not the fancy new ones from licensed sets. Because the old bricks were only limited by your creativity. You could move them around and put them anywhere because they were all the same. You could build amazing things with the right basic pieces.

Clouds and stacks aren’t all that dissimilar. We need to focus on building underlays of networking and compute systems with the same kinds of basic blocks if we ever hope to have something that we can build upon for the future. You may not be able to influence the design of systems at the most basic level when it comes to vendors and suppliers, but you can vote with your dollars to back the solutions that give you the flexibility to get your job done. I can promise you that when the revenue from proprietary, non-open underlay technologies goes down the suppliers will start asking you the questions you need to answer for them.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.

Adapting Applications And Avoiding Acrobatic Adjustments

JumpingHoops

A couple of months ago, I was on a panel at TechUnplugged where we talked about scaling systems to large sizes. Here’s a link to the video of that panel:

One of the things that we discussed in that panel was applications. Toward the end of the discussion we got into a bit of a back-and-forth about applications and the systems they run on. I feel like it’s time to develop those ideas a bit more.

The Achilles’ Heel

My comments about legacy applications are pointed. If a company is spending thousands of dollars and multiples hours of time in the engineering team to reconfigure the network or the storage systems to support an old application, my response was simple: go out of business.

It does sound a bit flippant to think that a company making a profit should just close the shutters and walk away. But that’s just the problem that we’re facing in the market today. We’ve spent an inordinate amount of time creating bespoke, custom networks and systems to support applications that were written years, or even decades, ago in alien environments.

We do it every day without thinking. We have to install this specific Java version on the server to get the payroll application running. Don’t install the new security updates on this server because it breaks the HR database. We have to have the door security server running in the same VLAN as the camera system or they won’t communicate. The list seems to be endless.

The problem is not that we’re jumping through hoops to fix all of these things. The problem is really that we shouldn’t have to do it. Look at all of the work that’s being done in agile development today. Companies are creating programs that do things that we thought were impossible just a few years ago. Features are being added rapidly without compromising stability. Applications can be deployed and run in the cloud or on-site. It’s a brave new world if you’re willing to throw out the basic assumption that applications are immutable.

The Arrow of Apollo

Everyone has made concessions to an application at some point in their career. We’ve done very dumb things to make software work. We do it so often that it becomes second nature. Entire product features have been developed because applications are too stupid to work in certain scenarios.

Imagine if you will what happens when you’re told that the new HR database doesn’t like it when the database is slow to respond. You likely start thinking of all the ways that you can make the network respond faster or reroute traffic to prevent this scenario from happening. You’re willing to actively destroy and reconfigure network architecture because one software program doesn’t like latency.

How far are you willing to take it? In some cases the answers are pretty scary. Cisco has a line of industrial switches with integrated NAT because some systems don’t know how to handle not being hard-coded with IP addresses. I’ve disabled roaming between access points in a wireless network because the medical records system couldn’t handle roaming to a new AP as it made the database connection close instantly. I’ve even had to disable QoS in a network because the extra latency it was causing in the CRM application was making the sales people cranky.

All of these solutions to problems assumed that the application that was having the problem can’t be fixed. What would happen if we turned that idea back around and fought the fight differently? What if we started the discussion of application problems by working through application programming issues instead? What if the fault of the system wasn’t at the network level or the compute level but instead was at the software level because the development team did a bad job of error handling?

With all the talk of making resources more like the cloud, where the underlay systems exist in a void apart from the software running on top of them we do have the draw the line at bad application programming. Netflix can’t assume that Amazon will create a special channel link to make application recovery faster. They have to write their software in such as way that it can survive problems in the underlay and avoid them.

Application developers need to stop thinking that the underlying systems will always work perfectly and fix any problems that might crop up. Instead, those applications that are being written need to account for issues and solve their own problems. The Netflix Chaos Monkey ensures that application folks don’t get complacent and code their programs to work when other things are absent. If Netflix stopped working for anything less than AWS going down they wouldn’t have the subscriber base they enjoy today. Netflix makes money because they don’t assume that the networks and systems will always be there. They make money because they know how to react when those systems aren’t there.


Tom’s Take

The new era of networking and mobile computing have embraced software. Containers have shown us how we can create instances that can be destroyed and recreated as needed. Applications on mobile devices work like they are supposed to because we can’t get into those subsystems to make workarounds for things. It’s time for other application developers to realize that too. We need devs that are willing to write logic into their programs to take control when things go wrong. Every thing that we can do to limit custom network configurations for applications takes us one step closer to the utopia we’ve always wanted – software that can run anywhere and scale as much as needed. Because resilient software is better than trying to fix the problems in hardware any day.

Intel and the Network Arms Race

IntelLogo

Networking is undergoing a huge transformation. Software is surely a huge driver for enabling technology to grow by leaps and bounds and increase functionality. But the hardware underneath is growing just as much. We don’t seem to notice as much because the port speeds we deal with on a regular basis haven’t gotten much faster than the specs we read about years go. But the chips behind the ports are where the real action is right now.

Fueling The Engines Of Forwarding

Intel has jumped into networking with both feet and is looking to land on someone. Their work on the Data Plane Development Kit (DPDK) is helping developers write code that is highly portable across CPU architecture. We used to deal with specific microprocessors in unique configurations. A good example is Dynamips.

Most everyone is familiar with this program or the projects that spawned, Dynagen and GNS3. Dynamips worked at first because it emulated the MIPS processor found in Cisco 7200 routers. It just happened that the software used the same code for those routers all the way up to the first releases of the 15.x train. Dynamips allowed for the emulation of Cisco router software but it was very, very slow. It almost didn’t allow for packets to be processed. And most of the advanced switching features didn’t work at all thanks to ASICs.

Running networking code on generic x86 processors doesn’t provide the kinds of performance that you need in a network switching millions of packets per second. That’s why DPDK is helping developers accelerate their network packet forward to approach the levels of custom ASICs. This means that a company could write software for a switch using Intel CPUs as the base of the system and expect to get good performance out of it.

Not only can you write code that’s almost as good as the custom stuff network vendors are creating, but you can also have a relative assurance that the code will be portable. Look at the pfSense project. It can run on some very basic hardware. But the same code can also run on a Xeon if you happen to have one of those lying around. That performance boost means a lot more packet switching and processing. No modifications to the code needed. That’s a powerful way to make sure that your operating system doesn’t need radical modifications to work across a variety of platforms, from SMB and ROBO all the way to an enterprise core device.

Fighting The Good Fight

The other reason behind Intel’s drive to get DPDK to everyone is to fight off the advances of Broadcom. It used to be that the term merchant silicon meant using off-the-shelf parts instead of rolling your own chips. Now, it means “anything made by Broadcom that we bought instead of making”. Look at your favorite switching vendor and the odds are better than average that the chipset inside their most popular switches is a Broadcom Trident, Trident 2, or even a Tomahawk. Yes, even the Cisco Nexus 9000 runs on Broadcom.

Broadcom is working their way to the position of arms dealer to the networking world. It soon won’t matter what switch wins because they will all be the same. That’s part of the reason for the major differentiation in software recently. If you have the same engine powering all the switches, your performance is limited by that engine. You also have to find a way to make yourself stand out when everything on the market has the exact same packet forwarding specs.

Intel knows how powerful it is to become the arms dealer in a market. They own the desktop, laptop, and server market space. Their only real competition is AMD, and one could be forgiven for arguing that the only reason AMD hasn’t gone under yet is through a combination of video card sales and Intel making sure they won’t get in trouble for having a monopoly. But Intel also knows what it feels like to miss the boat on a chip transition. Intel missed the mobile device market, which is now ruled by ARM and custom SoC manufacturing. Intel needs to pull off a win in the networking space with DPDK to ensure that the switches running in the data center tomorrow are powered by x86, not Broadcom.


Tom’s Take

Intel’s on the right track to make some gains in networking. Their new Xeon chips with lots and lots of cores can do parallel processing of workloads. Their contributions to CoreOS will help the accelerate the adoption of containers, which are becoming a standard part of development. But the real value for Intel is helping developers create portable networking code that can be deployed on a variety of devices. That enables all kinds of new things to come, from system scaling to cloud deployment and beyond.