A couple of months ago, I was on a panel at TechUnplugged where we talked about scaling systems to large sizes. Here’s a link to the video of that panel:
One of the things that we discussed in that panel was applications. Toward the end of the discussion we got into a bit of a back-and-forth about applications and the systems they run on. I feel like it’s time to develop those ideas a bit more.
The Achilles’ Heel
My comments about legacy applications are pointed. If a company is spending thousands of dollars and multiples hours of time in the engineering team to reconfigure the network or the storage systems to support an old application, my response was simple: go out of business.
It does sound a bit flippant to think that a company making a profit should just close the shutters and walk away. But that’s just the problem that we’re facing in the market today. We’ve spent an inordinate amount of time creating bespoke, custom networks and systems to support applications that were written years, or even decades, ago in alien environments.
We do it every day without thinking. We have to install this specific Java version on the server to get the payroll application running. Don’t install the new security updates on this server because it breaks the HR database. We have to have the door security server running in the same VLAN as the camera system or they won’t communicate. The list seems to be endless.
The problem is not that we’re jumping through hoops to fix all of these things. The problem is really that we shouldn’t have to do it. Look at all of the work that’s being done in agile development today. Companies are creating programs that do things that we thought were impossible just a few years ago. Features are being added rapidly without compromising stability. Applications can be deployed and run in the cloud or on-site. It’s a brave new world if you’re willing to throw out the basic assumption that applications are immutable.
The Arrow of Apollo
Everyone has made concessions to an application at some point in their career. We’ve done very dumb things to make software work. We do it so often that it becomes second nature. Entire product features have been developed because applications are too stupid to work in certain scenarios.
Imagine if you will what happens when you’re told that the new HR database doesn’t like it when the database is slow to respond. You likely start thinking of all the ways that you can make the network respond faster or reroute traffic to prevent this scenario from happening. You’re willing to actively destroy and reconfigure network architecture because one software program doesn’t like latency.
How far are you willing to take it? In some cases the answers are pretty scary. Cisco has a line of industrial switches with integrated NAT because some systems don’t know how to handle not being hard-coded with IP addresses. I’ve disabled roaming between access points in a wireless network because the medical records system couldn’t handle roaming to a new AP as it made the database connection close instantly. I’ve even had to disable QoS in a network because the extra latency it was causing in the CRM application was making the sales people cranky.
All of these solutions to problems assumed that the application that was having the problem can’t be fixed. What would happen if we turned that idea back around and fought the fight differently? What if we started the discussion of application problems by working through application programming issues instead? What if the fault of the system wasn’t at the network level or the compute level but instead was at the software level because the development team did a bad job of error handling?
With all the talk of making resources more like the cloud, where the underlay systems exist in a void apart from the software running on top of them we do have the draw the line at bad application programming. Netflix can’t assume that Amazon will create a special channel link to make application recovery faster. They have to write their software in such as way that it can survive problems in the underlay and avoid them.
Application developers need to stop thinking that the underlying systems will always work perfectly and fix any problems that might crop up. Instead, those applications that are being written need to account for issues and solve their own problems. The Netflix Chaos Monkey ensures that application folks don’t get complacent and code their programs to work when other things are absent. If Netflix stopped working for anything less than AWS going down they wouldn’t have the subscriber base they enjoy today. Netflix makes money because they don’t assume that the networks and systems will always be there. They make money because they know how to react when those systems aren’t there.
Tom’s Take
The new era of networking and mobile computing have embraced software. Containers have shown us how we can create instances that can be destroyed and recreated as needed. Applications on mobile devices work like they are supposed to because we can’t get into those subsystems to make workarounds for things. It’s time for other application developers to realize that too. We need devs that are willing to write logic into their programs to take control when things go wrong. Every thing that we can do to limit custom network configurations for applications takes us one step closer to the utopia we’ve always wanted – software that can run anywhere and scale as much as needed. Because resilient software is better than trying to fix the problems in hardware any day.
I largely agree with your post, but this example supports the opposite of the point you’re trying to make. Networks where these switches are installed are the perfect example of situations in which the network should make every effort to conform to the needs of the application: gas plants, refineries, chemical plants, etc. that were built in the 80s or 90s were designed to operate more or less continuously for 30+ years, with very infrequent shutdowns. A plant shutdown is planned a year or more in advance; it costs thousands of hours of engineering time and millions of dollars. The last thing they need is fundamental network changes to systems that already work and are highly reliable, when switches that cost a few thousand dollars can work around networking problems without shutdowns, interruptions, or added complexity.
Pingback: The Death of TRILL | The Networking Nerd
Pingback: The Complexity Conundrum | The Networking Nerd