A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.
Better The Devil You Know
Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.
But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.
I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.
While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.
I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.
When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.
Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.
Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.
Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.
Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.