Systems are inherently reliable. Until they aren’t. On a long enough timeline, even the most reliable system will eventually fail. How you manage that failure says a lot about the way your build your system or application. So, why is it then that we’re so focused on failing?
Ten Feet Tall And Bulletproof
No system is infallible. Networks go down. Cloud services get knocked offline. Even Facebook, which represents “the Internet” for a large number of people, has days when it’s unreachable. When we examine these outages, we often find issues at the core of the system that cause services to be unreachable. In the most recent case of Amazon’s cloud system, it was a typo in a script that executed faster than it could be stopped.
It could also be a failure of the system to anticipate increased loads when minor failures happen. If systems aren’t built to take on additional load when the worst happens, you’re going to see bigger outages. That is a particular thorn in the side of large cloud providers like Amazon and Google. It’s also something that network architects need to be aware of when building redundant pathways to handle problems.
Take, for example, a recent demo during Aruba Atmosphere 2017. During the Day 2 keynote, CTO Partha Narasimhan wowed the crowd in the room when he disclosed that they had been doing a controller upgrade during the morning talk. Users had been tweeting, surfing, and using the Internet without much notice from anyone aside from the most technical wireless minds in the room. Even they could only see some strange AP roaming behavior as an indicator of the controllers upgrading the APs.
Aruba showed that they built a resilient network that could survive a simulated major outage cause by a rolling upgrade. They’ve done everything they can to ensure uptime no matter what happens. But the bigger question for architects and engineers is “why are we solving the problem for others?”
Why Dodge Bullets When You Don’t Have To?
As amazing as it is to build a system that can survive production upgrades with no impact on users, what are we really building when we create these networks? Are we encouraging our users to respect our technology advantage in the network or other systems? Are we telling our application developers that they can count on us to keep the lights on when anything goes wrong? Or are we instead sending the message that we will keep scrambling to prevent issues in applications from being noticeable?
Building a resilient network is easy. Making something reliable isn’t rocket science. But create a network that is going to stay up for a long, long time without any outages is very expensive and process intensive. Engineering something to never be down requires layers of exception handling and backup systems that are as reliable as their primary counterparts.
A favorite story from the storage world involves recovery. When you initially ask a customer what their recovery point objective (RPO) is in a system, the answer is almost always “zero” or “as low as you can make it”. When the numbers are put together to include redundant or dual-active systems with replication and data assurance, the price tag of the solution is usually enough to start a new round of discussion along the lines of “how reliable can you make it for this budget?”
In the networking and systems world, we don’t have the luxury of sticker shock when it comes to creating reliability. Storage systems can have longer RPOs because lost data is gone forever. Taking the time to ensure it is properly recovered is important. But data in transmission can be retransmitted. That’s at the heart of TCP. So it’s expected that networks have near-instantaneous RPOs for no extra costs. If you don’t believe that, ask yourself what happens when you tell someone the network went down because there’s only one router or switch connecting devices together.
Instead of making systems ultra-reliable and absolving users and developers from thought, networks and systems should be as reliable as they can be made without huge additional costs. That reliability should be stated emphatically without wiggle room. These constraints should inform developers writing code so that exception handling can be built in to prevent issues when the inevitable outage occurs. Knowing your limitations is the first step to creating an atmosphere to overcome them.
A lesson comes from the programmers of old. When you have a limited amount of RAM, storage, or compute cycles, you can write very tight code. DOS programs didn’t need access to a cloud worth of compute. Mainframes could execute programs written on punch cards. The limitations were simple and could be overcome with proper problem solving. As compute and memory resources have exploded, so too have code bases. Rather than giving developers the limitless capabilities of the cloud without restraint, perhaps creating some limits is the proper way to ensure that reliability stays in the app instead of being bolted on to the network.
We had a lot of fun recording this roundtable. We talked about Aruba’s controller upgrade and building reliable wireless networks. But I think we also need to make sure we’re aware that continually creating protocols and other constructs in the underlay won’t solve application programming problems. Things like vMotion set networking and application development back a decade. Giving developers a magic solution to avoid building proper exception handling doesn’t make better developers. Instead, it puts the burden of uptime back on the networking team. And we would rather build the best network we can instead of building something that can solve every problem that could every possibly be created.