Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.

Troubleshooting and Triage

When troubleshooting any major issue, people tend to feel a bit lost at first.  There is the crowd that wants to fix the immediate problem.  Then there is the group that wants to look at everything going on and address the root problem no matter how long it takes.  The key to troubleshooting is to realize how each of these approaches has their place and how they are both right and wrong at the same time.

The first approach is triage.  Think of it like a medical emergency room.  Their purpose is to fix the immediate symptoms and stabilize the patient.  Especially critical is the stabilization part.  You can’t fix a network that has bouncing routes or intermittent bridging loops.  Often the true root cause of the problem is buried beneath a pile of other symptoms.  Only when the immediate issues are resolved does the real problem surface.  Learning how to triage problems is a very important troubleshooting skill.  It gives a quick response while allowing the worst of the issue to be dealt with.

It’s important to remember that triage is just a quick fix.  Emergency rooms would never triage a patient without following up with a more in-depth consult or return visit.  Triage fails when engineers leave the patch in place and consider it the final solution.  Most times that I’ve seen this approach have been due to time constraints.  Rather than spending the time to research and test to find the true problem people are content to make the majority of the symptoms go away no matter how briefly.  It happens all the time.

“Just make it work for now.  We’ll fix it later.”

“If we configure it like this, will it stay up until the end of the quarter?”

“We don’t have time to debate this.  The CEO wants things up NOW!”

True in-depth troubleshooting is what happens when we have time and a clear way to solve the deeper root issues.  Deep troubleshooting figures out that the cause of a route flap is actually a bad Ethernet cable.  That’s not something you can easily determine from a quick analysis.  It takes time and effort to figure out.  When I worked on an inbound desktop help desk, we tested for CD-ROM failures by flipping the IDE cables back and forth on the IDE ports on the motherboard.  In part, this was to test to ensure the drive failure followed the switch of cables and ports.  In addition, it also tested the cable and port to make sure the dead drive wasn’t masking a bigger failure.  It took more time to do it properly but we never ran into an issue where a good CD-ROM drive was returned and the problem persisted.

In-depth troubleshooting can fail when there are so many problems masking the real issue that you start trying to fix the wrong problem.  Tunnel vision is easy to get when working on a problem.  If you tunnel in on an ancillary symptom and fail to fix the root cause you aren’t really doing much better than simple triage.  Just like a doctor, you need to ensure that you are treating the real problem under all the symptoms.  Remember not to be sidetracked by each small issue you uncover.  Fix them and keep digging for the real issue underneath it all.


Tom’s Take

I’ve had a lot of people comment that I was able to figure out problems quickly.  They also liked how I was able to “fix” things quickly.  That’s because I was very good at triage.  In my job as a VAR engineer, I didn’t really have time to dig deeper into the issue to uncover root cause.  Thankfully, a couple of the guys that I worked with were the exact opposite of me.  They loved digging into problems and pulling everything apart until they found the real issue.  They were labeled “slow” or “methodical” by some.  I loved working with them because the complemented my style perfectly.  I fix the big issues and make people happy.  They fix the underlying cause and keep them that way.  Just like ER doctors and specialists.  We both have our place.  It’s important to realize which is more important at a given time.