I had the opportunity to sit in on a great briefing from Gremlin the other day about chaos engineering. Ken Nalbone (@KenNalbone) has a great review of their software and approach to things here. The more time I spent thinking about chaos engineering and IT, the more I realized that it has more in common with Murphy’s Law that we realize.
Anything That Can Go Wrong
If there’s more than one way to do a job and one of those ways will end in disaster, then somebody will do it that way. – Edward Murphy
Anything that can go wrong will go wrong. – Major John Paul Stapp
We live by the adage of Murphy’s Law in IT. Anything that can go wrong will go wrong. And usually it goes wrong at the worst possible time. Database query functions will go wrong when you need them the most. And usually at the height of something like Amazon Prime Day. Data center outages only seem to happen at 4 am on a Sunday during a holiday.
But why do things go wrong like this? Is it because the universe just has it out for IT people? Are we paying off karma from the fall of the Western Roman Empire? Or is it because we can’t anticipate some crazy things? Are we kidding ourselves that we can just manage Murphy and hope for the best?
As it turns out, this is why chaos engineering is so important. Because it doesn’t just make us realize that things are broken. It helps us understand how they will break in unique and different ways each time. A big reason why this is so important is because many large-scale failures aren’t the result of a single problem, but instead a collection of smaller things that build on each other.
One of my favorite stories about this collection of failures comes from a big Amazon Web Services (AWS) outage from last March. People were seeing problems in US-EAST-1 but they couldn’t nail down the issue. Worse yet, every time they logged into the Amazon dashboard they saw green lights for every service. As the minutes dragged on it was eventually discovered that the lights were lying to everyone because Amazon hosted that page on AWS US-EAST-1. They couldn’t log in to reset the lights to show an outage! Coincidentally, many other monitoring services were down as well because they were also hosted in the same region.
What does this teach us about chaos? Well, Murphy was in full effect for sure. Something went wrong and happened at a bad time. But it was also the worst possible time for Amazon to figure out that the status lights and dashboard systems were all hosted out of one region with no backup anywhere else. Perhaps they could have caught that with a system like Gremlin. Perhaps it would have gone under the radar until the worst possible moment like it did in real life. There’s no way to know for sure. Hopefully Amazon has fixed this little problem for now.
People Will Do It Wrong
This also teaches us something about user behavior. One thing we hear frequently about patches or other glaring issues with software is “How was this not caught in testing?!?”
The flip side of that is that most of these corner case issues were never tested in the first place. Testing focuses on testing main functionality of a system. QA testers focus on the big picture stuff first. Does the UI fall apart? Are all the buttons linked to a specific task? What happens when I click HELP on the login screen.
What does QA not test for? Well, lots of things that users actually do. Holding down random keystrokes while clicking buttons. Navigating to random pages and then bookmarking them without realizing that’s a bad idea. Typing the wrong information into a list box that passes validation and screws up the backend. The list of variations is endless.
How does this apply to chaos? Well, as it turns out, engineers and testers are pretty orderly people. We all look at problems and try to figure them out. We try combinations of things until we solve the issue. But everything is based on the idea that we’re trying things in specific combinations until we replicate the issue. We don’t realize that some of the random behavior we see comes from behaviors we can’t control from users.
Another story: I was editing a document the other day in a CMS and I saved the document revisions I’d made as a draft post. When I went to check the post, it had inadvertently published itself. I didn’t want it to publish at that time, so I was perplexed. I knew I had clicked the save function button but I also knew I didn’t click the publish button. I looked through documentation and couldn’t find any issues.
I put it out of my mind until it happened again a couple of weeks later. This time, I went back through every step I had just done. The only thing that was out of the ordinary compared to the last time was the I had saved the document with ⌘+S (CTRL+S for Windows) just like I’d taught myself to do for years. But, in this CMS, that shortcut saves and publishes the current document. Surprise!
Behavior that shouldn’t have triggered a problem did. Because no one ever tested for what might happen if someone used a familiar keystroke in a place where it wasn’t intended. This is what makes chaos engineering so difficult and rewarding. Because you can set up the system to test for those random things without needing to think about them. And when you figure out a new one, like whether or not ⌘+S can crash your system, you can add it to the list to be checked against everything!
Tom’s Take
I love reading and learning about chaos engineering. The idea that we purposely break things to make people thing about building them correctly appeals to me. I find myself trying to figure out how to make better things and always find out that I’m being stymied because I don’t think “outside the box”, which is a clever way of saying that I don’t think like a user. I need something that helps me understand how things will break in new and unique ways every time. Because while we can test for the big stuff, Murphy has a way of showing us what happens when we don’t sweat the small stuff.