The Process Will Save You

I had the opportunity to chat with my friend Chris Marget (@ChrisMarget) this week for the first time in a long while. It was good to catch up with all the things that have been going on and reminisce about the good old days. One of the topics that came up during our conversation was around working inside big organizations and the way that change processes are built.

I worked at IBM as an intern 20 years ago and the process to change things even back then was arduous. My experience with it was the deployment procedures to set up a new laptop. When I arrived the task took an hour and required something like five reboots. By the time I left we had changed that process and gotten it down to half an hour and only two reboots. However, before we could get the new directions approved as the procedure I had to test it and make sure that it was faster and produced the same result. I was frustrated but ultimately learned a lot about the glacial pace of improvements in big organizations.

Slow and Steady Finishes the Race

Change processes work to slow down the chaos that comes from having so many things conspiring to cause disaster. Probably the most famous change management framework is the Information Technology Infrastructure Library (ITIL). That little four-letter word has caused a massive amount of headaches in the IT space. Stage 3 of ITIL is the one that deals with changes in the infrastructure. There’s more to ITIL overall, including asset management and continual improvement, but usually anyone that takes ITIL’s name in vain is talking about the framework for change management.

This isn’t going to be a post about ITIL specifically but about process in general. What is your current change management process? If you’re in a medium to large sized shop you probably have a system that requires you to submit changes, get the evaluated and approved, and then implemented on a schedule during a downtime window. If you’re in a small shop you probably just make changes on the fly and hope for the best. If you work in DevOps you probably call them “deployments” and they happen whenever someone pushes code. Whatever the actual name for the process is you have one whether you realize it or not.

The true purpose of change management is to make sure what you’re doing to the infrastructure isn’t going to break anything. As frustrating as it is to have to go through the process every time the process is the reason why. You justify your changes and evaluate them for impact before scheduling them. As opposed to something that can be termed as “Change and find out” kind of methodologies.

Process is ugly and painful and keeps you from making simple mistakes. If every part of a change form needs to be filled out you’re going to complete it to make sure you have all the information that is needed. If the change requires you to validate things in a lab before implementation then it’s forcing you to confirm that it’s not going to break anything along the way. There’s even a process exception for emergency changes and such that are more focused on getting the system running as opposed to other concerns. But whatever the process is it is designed to save you.

ITIL isn’t a pain in the ass on accident. It’s purposely built to force your justify and document at every step of the process. It’s built to keep you from creating disaster by helping you create the paper trail that will save you when everything goes wrong.

Saving Your Time But Not Your Sanity

I used to work with a great engineer name John Pross. John wrote up all the documentation for our migrations between versions of software, including Novell NetWare and Novell Groupwise. When it came time to upgrade our office Groupwise server there was some hesitation on the part of the executive suite because they were worried we were going to run into an error and lock them out of their email. The COO asked John if he had a process he followed for the migration. John’s response was perfect in my mind:

“Yes, and I treat every migration like the first one.”

What John meant is that he wasn’t going to skip steps or take shortcuts to make things go faster. Every part of the procedure was going to be followed to the letter. And if something came up that didn’t match what he thought the output should have been it was going to stop until he solved that issue. John was methodical like that.

People like to take shortcuts. It’s in our nature to save time and energy however we can. But shortcuts are where the change process starts falling apart. If you do something different this time compared to the last ten times you’ve done it because you’re in a hurry or you think this might be more efficient without testing it you’re opening yourself up for a world of trouble. Maybe not this time but certainly down the road when you try to build on your shortcut even more. Because that’s the nature of what we do.

As soon as you start cutting corners and ignoring process you’re going to increase the risk of creating massive issues rapidly. Think about something as simple as the Windows Server 2003 shutdown dialog box. People used to reboot a server on a whim. In Windows 2003, the server had a process that required you to type in a reason why you were manually shutting the server down from the console. Most people that rebooted the server fell into two camps: Those that followed their process and typed in the reason for the reboot and those that just typed “;Lea;lksjfa;ldkjfadfk” as the reason and then were confused six months from now when doing the post-mortem on an issue and cursing their snarky attitude toward reboot documentation.

Saving the Day

Change process saves you in two ways. The first is really apparent: it keeps you from making mistakes. By forcing you to figure out what needs to happen along the way and document the whole process from start to finish you have all the info you need to make things successful. If there’s an opportunity to catch mistakes along the way you’re going to have every opportunity to do that.

The second way change process saves you is when it fails. Yes, no process is perfect and there are more than a few times when the best intentions coupled with a flaw in the process created a beautiful disaster that gave everyone lots of opportunity to learn. The question always comes back to what was learned in that process.

Bad change failures usually lead to a sewer pipe of blame being pointed in your direction. People use process failures as a change to deflect blame and avoid repercussions for doing something they shouldn’t have or trying to increase their stock in the company. The truly honest failure analysis doesn’t blame anyone but the failed process and tries to find a way to fix it.

Chris told me in our conversation that he loved ITIL at one of his former jobs because every time it failed it led to a meaningful change in the process to avoid failure in the future. These are the reasons why blameless post-mortem discussions are so important. If the people followed the process and the process the people aren’t at fault. The process is incorrect or flawed and needs to be adjusted.

It’s like a recipe. If the instructions tell you to cook something for a specific amount of time and it’s not right, who is to blame? Is it you because you did what you were told? Is the recipe? Is it the instructions? If you start with the idea that you did the process right and start trying to figure out where the process is wrong you can fix the process for next time. Maybe you used a different kind of ingredient that needs more time. Or you made it thinner than normal and that meant cooking it too long this time. Whatever the result, you end up documenting the process and changing things for the future to prevent that mistake from happening again.

Of course, just like all good frameworks, change processes shouldn’t be changed without analysis. Because changing something just to save time or take a shortcut defeats the whole purpose! You need to justify why changes are necessary and prove they provide the same benefit with no additional exposure or potential loss. Otherwise you’re back to making changes and hoping you don’t get burned this time.


Tom’s Take

ITIL didn’t really become a popular thing until after I left IBM but I’m sure if I were still there I’d be up to my eyeballs in it right now. Because ITIL was designed to keep keyboard cowboys like me from doing things we really shouldn’t be doing. Change management process are designed to save us at every step of the way and make us catch our errors before they become outages. The process doesn’t exist to make our lives problematic. That’s like saying a seat belt in a car only exists to get in my way. It may be a pain when you’re dealing with it regularly but when you need it you’re going to wish you’d been using it the whole time. Trust in the process and you will be saved.

It’s The Change Freeze Season

Everyone’s favorite time of the year is almost here! Is it because it’s the holiday season? Perhaps it’s the magic that happens at the end of the year? Or maybe, it’s because there’s an even better reason to get excited!

Change Freeze Season!

That’s right. Some of you reading this started jumping up and down like Buddy the Elf at the thought of having a change freeze. There’s something truly magical about laying down the law about not touching anything in the system until after the end-of-year reports are run and certified. For some, this means a total freeze of non-critical changes from the first of December all the way through the New Year until maybe even February. That’s a long time to have a frozen network? But why?

The Cold Shoulder

Change freezes are an easy thing to explain to the new admins. You simply don’t touch anything in the network during the freeze unless it’s broken. No tweaking. No experimenting. No improvements. Just critical break/fix changes only. There had better be a ticket. There should be someone yelling that something’s not right. Otherwise you’re in for it.

There are a ton of reasons for this. The first is something I remember from my VAR days as Boredom Repellent. When you find yourself at the end of year with nothing to do, you tend to get bored. After you’ve watched Die Hard for the fifteenth time this year you decide it’s time to clear out your project backlog. Or maybe you’ve been doing some learning modules instead. You find a great blog post from one of your favorite writers about a Great Awesome Amazing Feature That Will Save You Days Of Work If You Just Enable This One Simple Command!

In either case, the Boredom Repellent becomes like pheromones for problems. Those backlogged projects take more time than you expected. That simple feature you just need to enable isn’t so simple. It might even involve an entire code upgrade train to enable it. Pretty soon you find yourself buried in a CLI mess with people screaming about very real downtime. Now, instead of being bored you’re working until the wee hours of the night because of something you did.

The second reason for change freezes at the end of the year is management. You know, the people that call and scream at you as soon as their email appears to be running slow. The people that run reports once a month at 6:00pm and then call you because they get a funny warning message on their screen. Those folks. Guess what? End-of-year is their time to shine in all their glory.

This is usually the time they are under the most stress. Those reports have to be reprinted. All the financials from the year need to be consolidated and verified. The taxes will need to be paid. And all that paperwork and pressure adds up to stress. The kind of stress that makes any imperfections in the network seem ten times more important than before. Report screen not show success within 10 ms? Problem. Printer run out of yellow toner? Network problem. Laptop go to sleep while someone went to lunch and now the entire report is gone? Must be your problem. And guess who gets to work around the clock to solve it with someone bearing down on them from on high?

Don’t Let It Go

The fact is that we can’t have people doing things in the network without tracking those changes back to reasons. That applies for adventurous architects wanting to squeeze out the last ounce of performance from that amazing new switch. And it goes double for the CFO demanding you put his traffic into AF41 so it gets to the server faster so his reports don’t take six hours to print.

It all comes back to the simple fact that we have no way to track changes in our network and we have no way of knowing what will happen when we make one live. It feels an awful lot like this GIF:

Crazy, right? Yet every time we hit the Enter key, we are amazed at the results. Even for “modern” OSes with sanity checking, like Junos or IOS-XR, you have no way of knowing if a change you make on one device somewhere in the branch is going to crash OSPF or BGP for the entire organization. And even if there was a big loud warning popup that said, “ALERT: YOU ARE GOING TO BREAK EVERYTHING!!!”, odds are good we would just click past it.

Network automation and orchestration systems can prevent this. They can take the control of change management out of the hands of bored engineers and wrap it in process and policy. And if the policy says Change Freeze then that’s what you get. No changes. Likewise, if there is a critical need, like patching out a backdoor or something, that policy can be overridden and noted so that if there is a bug eight months from now in that code train that causes issues you can have documentation of the reason for the change when someone comes to chew you out.

Likewise, there are other solutions out there that try to prototype the entire network to figure out what will happen when you make a change. Companies like Forward Networks and Veriflow can prototype your network in a model that can assess the impact of a change before you commit to it. It’s the dream of a bored engineer because you can run simulations to your heart’s content to find out if two hours of code upgrades will really get you that 2% performance increase promised in that blog post. And for the CFO/CEO/CIO screaming at you to prioritize their traffic, these solutions can remind them that most of their traffic is Youtube and Spotify and having that at AF41 will cause massive issues for them.

What’s important is that you and the rest of the team realize that change freezes aren’t a solution to the problem of an unstable network. Instead, they are treating the symptoms that crop up from the underlying disease of the network not being a deterministic system. Unlike some other machines, networks run just fine at sub-optimal performance levels. You can make massive mistakes that will live in a network for years and never show their ugly face. That is, until you make a small change that upsets equilibrium and causes the whole system to fail, cascade style, and leave you holding the keyboard as it were.


Tom’s Take

I both love and hate Change Freeze season. I know it’s for the best because any changes that get made during this time will ultimately result in long hours at work undoing those changes. I also know that the temptation to experiment with things is very, very strong this time of year. But I feel like Change Freeze season will soon go the way of the aluminum Christmas tree when we get change management and deterministic network modeling systems in place to verify changes on a system-wide basis and not just sanity checking configs at a device level. Tracking, prototyping, and verification will solve our change freeze problems eventually. And that will make it the most wonderful time of the year all year long.