What Is Closed-Loop Automation?

During Networking Field Day 22 last week, a lot the questions that were directed at the presenters had to do with their automation systems. One term kept coming up that I was embarrassed to admit that I’d never heard of. Closed-loop automation is the end goal for these systems. But what is closed-loop automation? And why is it so important. I decided to do a little research and find out.

Open Up

To understand closed-loop systems, you have to understand open-loop systems first. Thankfully, those are really simple. Open-loop systems are those where the output isn’t directly affected by the control actions of the system. It’s a system where you’re going to get the output no matter how you control it. The easiest example is a clothes dryer. There are a multitude of settings that you can choose for a clothes dryer, including the timing of the cycle. But no matter what, the dryer will stop at the end of the cycle. There’s no sensor in a basic clothes dryer that senses the moisture level of the clothes and acts accordingly.

Open-loop systems are stable and consistent. Every time you turn on the dryer, it will run until it finishes. There’s no variable in the system that will change that. Aside from system failure, it’s going to run exactly 30 minutes every time it’s set to that cycle. It’s also not going to run unless you set the cycle. As my family will tell you, putting clothes in the dryer and not setting it will not result in magic happening.

Close It Off

In contrast, closed-loop systems have outputs that are dependent upon the control function of the system. If the control function requires something change in the system to achieve the desired output then it will change that thing to get there.

The most classic example of a closed-loop system is the HVAC system in your house. The control function is the thermostat. If you want the temperature in your house to be 70 degrees Fahrenheit (21 degrees C), you set the thermostat and let the system take care of things. If the temperature falls below the required setting, the heating unit will turn on and bring the temperature up to the required level before shutting off. In the summertime, rising above the temperature setting will cause the air conditioning compressor to kick on and cool things down.

Closed loop systems are great because you set them and forget them. Unlike the dryer example above I can set my thermostat and it will run even if I forget to go turn on the heater/AC. But they’re also more complicated to troubleshoot and figure out. As someone with very little practical knowledge of the operation of HVAC it’s rough to figure out if it’s the thermostat or the unit or some other relay somewhere that’s causing your house to be too warm or too cold.

Closed loop systems can also take more inputs given the right control settings. Using the same A/C example, I upgraded my thermostat from a basic model to one from Ecobee. Once I got it installed I had a lot of extra control over what I could do with it. For example, I could now have the settings in the house run based on time-of-day instead of just one basic setting all the time. If I wanted it colder at night I could tell the system to look at the time and change the setting until it was sunrise. I could also tell it to look for me to be home (using geolocation) and raise and lower the temperature if my geotoken, in this case my phone, wasn’t in the area. The possibilities are endless because the system is driven by those inputs.

Automatic for the Non-people

Let’s extend the idea of closed-loop systems to network automation. Now, you can make a system (the network) behave a certain way based on inputs to the control functions. This is a massive change from the steady-state that we’ve worked years to achieve. The system can now react to changes in state or inputs. Massive file transfer activity being done between two branch locations? Closed-loop automation can reprogram the edge SD-WAN gateways to implement QoS policies based on the traffic types to preserve bandwidth for voice calls or critical application traffic. When the transfers are done the system can clean up the policies.

Because closed-loop automation can do a wide variety of actions based on inputs, data becomes super valuable. The information your system is providing as feedback can create more stable results. Open-loop systems are super stable because they are incapable of change. They also run every time someone tells them to run. They require intervention. Closed-loop systems are capable of running without the need for people based solely on the data you get from the system. But they also have issues because bad data or inputs can cause the system to react in strange ways. For example, if the thermostat in a house is placed in direct sunlight or has an error that causes it to think the house is 90 degrees, the A/C compressor may kick on even if the house temperature is far, far below that. Data has to be correct for the system to work as intended.


Tom’s Take

The promise of closed-loop automation is exciting. The ability for the network to run without our help is music to my ears. But it also means we have to be more diligent about keeping the control functions of the system working properly with the correct data inputs. It also means we need to monitor the control system outputs to head off problems before they can impact the reliability of the system. I can’t wait to see how we continue to close the loop and create better, more responsive systems in the future.

The Development of DevNet’s Future

You’re probably familiar with Cisco DevNet. If not, DevNet is the place Cisco has embraced outreach to the developer community building for software-defined networking (SDN). Though initially cautious in getting into the software developer community, Cisco has embraced their new role and really opened up to help networking professionals embrace the new software normal in networking. But where is DevNet going to go from here?

Humble Beginnings

DevNet wasn’t always the darling of Cisco’s offerings. I can remember sitting in on some of the first discussions around Cisco OnePK and thinking to myself, “This is never going to work.”

My hesitation with Cisco’s first attempts to focus on software platforms came from two places. The first was what I saw as Cisco trying to figure out how to extend the platforms to include some programmability. It was more about saying they could do software and less about making that software easy to use or program against. The second place was actually the lack of a place to store all of this software knowledge. Programmers and developers are fickle lot and you have to have a repository where they can get access to the pieces they needed.

DevNet was that place that Cisco should have built from the start. It was a way to get people excited and involved in the process. But it wasn’t for everyone at first. If you don’t speak developer you’re going to feel lost. Even if you are completely fluent in networking and you know what you want to accomplish, just not how to get there. DevNet started off as the place to let the curious learn how to combine networking and programming.

The Ascent

DevNet really came into their own about 3 years ago. I use that timeline because that’s when I first heard that people were wanting to spend more time at Cisco Live in the DevNet Zone learning programming and other techniques and less time in traditional sessions. Considering the long history of Cisco Live that’s an impressive feat.

More importantly, DevNet changed the conversation for professionals. Instead of just being for the curious, DevNet became a place where anyone could go and find the information they needed. It became a resource. Not just a playground. Instead of poking around and playing with things it became a place to go and figure things out. Or a place to learn more about a new technology that you wanted to implement, like automation. If the regular sessions at Cisco Live were what you had to learn, DevNet is where you wanted to go and learn.

Susie Wee (@SusieWee) deserves all the credit in the world here. She has seen what the developer community needs to thrive inside of Cisco and she’s delivered it. She’s the kind of ambassador that can go between the various Cisco business units (BUs) and foster the kind of attitudes that people need to have to succeed. It’s no longer about turf wars or fiefdoms. Instead, it’s about leveraging a common platform for developers and networkers alike to find a common ground to build from. But even that’s not enough to complete the vision.

Narrow of Purpose, Wide of Vision

During Cisco Live 2019, I talked quite a bit with Susie and her team. And one of things that struck me from our conversations was not how DevNet was an open and amazing place. Or how they were adding sessions as fast as they could find instructors. It was that so many people weren’t taking advantage of it. That’s when I realized that DevNet needs to shift their focus. Instead of just providing a place for networking people to learn, they’re going to have to go on the offensive.

DevNet needs to enhance and increase their outreach programs. Being a static resource is fine when your audience is eager to learn and looking for answers. But those people have already flocked to the DevNet banner. For things to grow, DevNet needs to pull the laggards along. The people who think automation is just a fad. Or that SDN is in the Trough of Disillusionment from a pretty Gartner graphic. DevNet has momentum, and soon will have the certification program needed to help networking people show off their transformation to developers.


Tom’s Take

For DevNet to really succeed, they need to be grabbing people by the collar and dragging them to the new reality of networking. It’s not enough to give people a place to do research on nice-to-have projects. You’re going to have get the people engaged and motivated. That means committing resources to entry-level outreach. Maybe even building a DevNet Academy similar to the Cisco Academy. But it has to happen. Because the people that aren’t already at DevNet aren’t going to get there on their own. They need a push (or a pull) to find out what they don’t know that they don’t know.

IT And The Exception Mentality

If you work in IT, you probably have a lot in common with other IT people. You work long hours. You have the attitude that every problem can be fixed. You understand technology well enough to know how processes and systems work. It’s fairly common in our line of work because the best IT people tend to think logically and want to solve issues. But there’s something else that I see a lot in IT people. We tend to focus on the exceptions to the rules.

Odd Thing Out

A perfectly good example of this is automation. We’ve slowly been building toward a future when software and scripting does the menial work of network administration and engineering. We’ve invested dollars and hours into making interfaces into systems that allow us to repeat tasks over and over again without intervention. We see it in other areas, like paperwork processing and auto manufacturing. There are those in IT, especially in networking, that resist that change.

If you pin them down on it, sometimes the answers are cut and dried. Loss of job, immaturity of software, and even unfamiliarity with programming are common replies. However, I’ve also heard a more common response growing from people: What happens when the automation screws up? What happens when a system accidentally upgrades things when it shouldn’t? Or a system disappears because it was left out of the scripts?

The exception position is a very interesting one to me. In part, it stems from the fact that IT people are problem solvers. This is especially true if you’ve ever worked in support or troubleshooting. You only see things that don’t work correctly. Whether it’s software misconfiguration, hardware failure, or cosmic rays, there is always something that is acting screwy and it’s your job to fix it. So the systems that you see on a regular basis aren’t working right. They aren’t following the perfect order that they should be. Those issues are the exceptions.

And when you take someone that sees the exceptions all day long and tries to fix them, you start looking for the exceptions everywhere in everything. Instead of wondering how a queue moves properly at the grocery store you instead start looking at why people aren’t putting things on the conveyor belt properly. You start asking whey someone brought 17 items into the 15-items-or-less line. You see all the problems. And you start trying to solve the issues instead of letting the process work.

Let’s step back to our automation example. What happens when you put the wrong upgrade image on a device? Was there a control in place to prevent you from doing that? Did you go around it because you didn’t think it was the right way to do something? I’ve heard stories of the upgrade process not validating images because they couldn’t handle the errors if the image didn’t match the platform. So the process relies on a person to validate the image is correct in the first place. And that much human interaction in a system will still cause issues. Because people make mistakes.

Outlook on Failure

But whether or not a script is going to make an error doesn’t affect the way we look at them and worry. For IT people that are too focused on the details, the errors are the important part. They tend to forget the process. They don’t see the 99% of devices that were properly upgraded by the program. Instead, they only focus on the 1% that weren’t. They’re only interested in being proved right by their suspicions. Because they only see failed systems they need the validation that something didn’t go right.

This isn’t to say that focusing on the details is a bad thing. It’s almost necessary for a troubleshooter. You can’t figure out where a process breaks if you don’t understand every piece of it. But IT people need to understand the big picture too. We’re not automating a process because we want to create exceptions. We’re doing it because we want to reduce stress and prevent common errors. That doesn’t mean that all errors are going to go away overnight. But it does mean that we can prevent common ones from occurring.

That also means that the need for expert IT people to handle the exceptions is even more important. Because if we can handle the easy problems, like VLAN typos or putting in the wrong subnet range, you can better believe that the real problems are going to take a really smart person to figure out! That means the real value of an expert troubleshooter is using the details to figure out this exception. So instead of worrying about why something might go wrong, you can instead turn your attention to figuring out this new puzzle.


Tom’s Take

I know how hard it is to avoid focusing on exceptions because I’m one of the people that is constantly looking for them. What happens if this redistribution crashes everything? What happens if this switch isn’t ready for the upgrade? What happens if this one iPad can’t connect to the wireless. But I realize that the Brave New World of IT automation needs people that can focus on the exceptions. However, we need to focus on them after they happen instead of worrying about them before they occur.

Managing Automation – Fighting Fear of Job Justification

Dear Employees

 

We have decided to implement automation in our environment because robots and programs are way better than people. We will need you to justify your job in the next week or we will fire you and make you work in a really crappy job that doesn’t involve computers while we light cigars with dollar bills.

 

Sincerely, Management

The above letter is the interpretation of the professional staff of your organization when you send out the following email:

We are going to implement some automation concepts next week. What are some things you wish you could automate in your job?

Interpretations differ as to the intent of automation. Management likes the idea of their engineering staff being fully tasked and working on valuable projects. They want to know their people are doing something productive. And the people that aren’t doing productive stuff should either be finding something to do or finding a new job.

Professional staff likes being fully tasked and productive too. They want to be involved in jobs and tasks that do something cool or justify their existence to management. If their job doesn’t do that they get worried they won’t have it any longer.

So, where is the disconnect?

You Do Exist (Sort of)

The problem with these interpretations comes down to the job itself. Humans can get very good at repetitive, easy jobs. Assembly line works, quality testers, and even systems engineers are perfect examples of this. People love to do something over and over again that they are good at. They can be amazing when it comes to programming VLANs or typing out tweets for social media. And those are some pretty easy jobs!

Consistency is king when it comes to easy job tasks. When I can do it so well that I don’t have to think about things any more I’ve won. When it comes out the same way every time no matter how inattentive I am it’s even better. And if it’s a task that I can do at the same time or place every day or every week then I’m in heaven. Easy jobs that do themselves on a regular schedule are the key to being employed forever.

Automatic For The Programs

Where does that sound more familiar in today’s world? Why, automation of course! Automation is designed to make easy, repeatable jobs execute on a schedule or with a specific trigger. When that task can be done by a program that is always at work and never calls in sick or goes on vacation you can see the allure of it to management. You can also see the fear in the eyes of the professional that just found the perfect role full of easy jobs that can be scheduled on their calendar.

Hence the above interpretation of the automation email sample. People fear change. They fear automation taking away their jobs. Yet, the jobs that are perfect for automation are the kinds of things that shouldn’t be jobs in the first place. Professionals in a given discipline are much, much smarter than just doing something repetitively over and over again like VLAN modifications or IP addressing of interfaces. More importantly, automation provides consistency. Automation can be programmed to pull from the correct database or provide the correct configuration every time without worry of a transcription mistake or data entry in the wrong field.

People want these kinds of jobs because they afford two important things: justification of their existence and free time at work. The former ensures they get to have a paycheck. The latter gives them the chance to find some kind of joy in their job. Sure, you have some kind of repetitive task that you need to do every day like run a report or punch holes in a sheet of metal. But when you’re not doing that task you have the freedom to do fun stuff like learn about other parts of your job or just relax.

When you take away the job with automation, you take away the cover for the relaxation part. Now, you’ve upset the balance and forced people to find new things to do. And that means learning. Figuring out how to make tasks easy and repetitive again. And that’s not always possible. Hence the fear of automation and change.

Building A Better Path To Automation

How do we fix this mess? How can we motivate people to embrace automation? Well, it’s pretty simple:

  1. Help Your Team See The Need – If your teams think think they’re going to lose their jobs because of automation, they’re not going to embrace it. You need to show them that not only are they not going to lose their jobs but how automation will make their jobs easier and better. Remember to frame your arguments along the lines of removing mistakes and not needing to worry about justifying your existence in a role. That should encourage everyone to look for new challenges to overcome.
  2. Show the Value – This goes with the first part somewhat, but more than showing the need for automation with mistake reduction or schedule easing, you also need to show value. If a person has never made a mistake or has built their schedule around repetitive tasks they are going to hate automation. Show them what they can do now that their roles don’t have to focus on the old stuff they did. Help them look at where they can provide additional value. Even if it starts off by monitoring the automation platform to make sure it’s executing correctly. Maybe the value they can provide is finding new things to automate!
  3. Embrace the Future – Automation allows people to learn how to do new things. They can focus on new skills or roles that help support the business in a better way. More automation means more complexity to understand but also a chance for people to shine in new roles. The right people will see a challenge as something to be overcome. Help them set new goals. Help them get where they want to be. You’ll be surprised how quickly they will get there with the right leadership.

Tom’s Take

Automation isn’t going to steal jobs. It will force people to examine their tasks and decide how important they really are. The people that were covering their basic roles and trying to skate by are going to leave no matter what. Even if your automation push fails these marginal people are going to leave for greener pastures thanks to the examination of what they’re actually doing. Don’t let the pushback discourage you in the short term. Automation isn’t the goal. Automation is the tool to get you to the true goal of a smoother, more responsive team that accomplishes more and can reacher higher goals.

Automating Documentation

Tedium is the enemy of productivity. The fastest way for a task to not be done is to make it long, boring, and somewhat complicated. People who feel that something is tedious or repetitive are the ones more likely to marginalize a task. And I think I speak for the entire industry when I say that there is no task more tedious and boring than documentation. So how can we fix it?

Tell Me What You Did

I’m not a huge fan of documentation. When I decide on a plan of action, I rarely write it down step-by-step unless I’m trying to train someone. Even then, it looks more like notes with keywords instead of a narrative to follow. It’s a habit that has been borne out of years of firefighting in networks and calls to “do it faster”. The essential items of a task are refined and reduced until all that remains is the work and none of the ancillary items, like documentation.

Based on my previous life as a network engineer, I can honestly say that I’m not alone in this either. My old company made lots of money doing network discovery engagements. Sometimes these came because the previous admins walked out the door with no documentation. Other times, it was simply because the network had changed so much since the last person made any notes that what was going on didn’t resemble anything like what they thought it was supposed to look like.

This happens everywhere. It doesn’t take many instances of an network or systems professional telling themselves, “Oh, I’ll write it down later…” for later to never come. Devices get added, settings get changed, and not one word is ever written down. That’s the kind of chaos that causes disorganization at best and outages at worst. And I doubt there’s any networking pro out there that hasn’t been affected by bad documentation at one time or another.

So, how do we fix documentation? It’s tedious for sure. Requiring it as part of the process just invites people to find ways around it. And good documentation takes time. Is there a way to combine the lack of time, lack of requirement, and repetition and make documentation something that is done again? I think there is. And it requires a little help from process.

Not Too Late To Automate

Automation is a big thing right now. SDN is driving it. Network complexity is practically requiring it. Yet networking professionals are having a hard time embracing it. Why?

In part, networking pros don’t like to spend hours solving a problem that can be done in minutes. If you don’t believe me, watch one of the old SNL Nick Burns sketches. Nick is more likely to tell you to move than tell you how to fix your problem. Likewise, if a network pro is spending four hours writing an automation script that is supposed to execute a change that can be made in 20 minutes, they’re not going to want to do it. It’s just the nature of the job and the desire of the network professional to make every minute count.

So, how can we drive adoption of automation? As it turns out, automating documentation can be a huge driver. Automation of tedious tasks is exactly the thing that scripting and automation was designed to solve. Instead of focusing on the automation of the task, like adding VLANs to a set of switches, focus on the ability of the system to create documentation on the fly from the change.

Let’s walk through an example. In order for documentation to matter, it has to answer the 5 Ws. How can we automate that?

Let’s start with Who. Automation can create documentation saying user Hollingsworth made a change through an automated process. That helps the accounting side of the house figure out the person making changes in the network. If that person is actually a script, the Who can be changed to reflect that it was an automated process called by a person related to a change ticket. That gives everyone the ability to track the changes back to a given problem. And it can all be pulled in without user intervention.

What is also an easy automation task. List the configuration being applied. At first, the system can simply list the configuration to be programmed. But for menial and repetitive tasks like VLAN additions you can program the system with a real description like “Adding VLANs to $Switch to support $ticket”. Those variables can be autopopulated based on the work to be done. Again, we reference a ticket number in order to prove that these changes are coming from somewhere.

When is also critical. Are these changes happening in a maintenance window? Or did someone check them in in the middle of the day because they won’t cause any problems? (SPOILER ALERT: They will) By required a timestamp for changes, you can track which professionals are being cavalier with their change management. You can also find out if someone is getting into the system after hours to cause problems or attempt to compromise things. Even if the cause of the change is “immediately” due to downtime or emergency, knowing why it had to be checked in right away is a clue to finding problems that recur in the network.

Where is a two-pronged reason. It’s important to check where the changes are going to be applied. Is it going to be done to all switches in the organization? Or just a set in a remote office. Sanity checking via documentation will keep you from bricking your entire organization in one fell swoop. Likewise, knowing where the change is being checked in from is important. Is a remote office trying to change config on HQ switches? Is a remote engineer dialed in making changes related to an open support case? Is someone from a foreign nation making changes via VPN at 4:30am local time? In every case, you’d really want to know what’s going on before those changes get made.

Why is the one that will trip up everything. If you don’t believe me, I’d like to give you the top two reasons why Windows Server 2003 is shut down and rebooted with the shutdown justification dialog box:

  1. a;lkdjfalkdflasdfkjadlf;kja;d
  2. JUST ****ing SHUT DOWN!!!!!

People don’t like justifying their decisions. Even when I worked for Gateway 2000 on their national help desk, our required call documentation was a bit spotty when it came to justification for changes. Why did you decide to FDISK and reload? Why are you going into the registry to fix the icon colors? Change justification is half of documentation. It gives people something to audit. It gives people a way to look at things and figure out why you started down the path of a particular reasoning for problem solving. It also provides context for you after the fact when you can’t figure out why you did it the way you did.


Tom’s Take

Automation isn’t going to take away your job. Automation is going to do the jobs you hate doing. It’s going to make your life easier to concentrate on the tasks that need to be done by freeing you from the tasks that should be done and aren’t. If we can make automation document our networks for just six months, I think you’ll find the value in programming things to work this way. I also think you’ll be happier with the level of detail on your network. And once you can prove the value of automating just one task to your teams, I’m sure they’ll see the value of increasing automation all around.

Automating Your Job Away Isn’t Easy

programming

One of the most common complaints about SDN that comes from entry-level networking folks is that SDN is going to take their job away. People fear what SDN represents because it has the ability to replace their everyday tasks and put them out of a job. While this is nowhere close to reality, it’s a common enough argument that I hear it very often during Q&A sessions. How is it that SDN has the ability to ruin so many jobs? And how is it that we just now have found a way to do this?

Measure Twice

One of the biggest reasons that the automation portion of SDN has become so effective in today’s IT environment is that we can finally measure what it is that networks are supposed to be doing and how best to configure them. Think about the work that was done in the past to configure and troubleshoot networks. It’s often a very difficult task that involves a lot of intuition and guesswork. If you tried to explain to someone the best way to do things, you’d likely find yourself at a loss for words.

However, we’ve had boring, predictable standards for many years. Instead of cobbling together half-built networks and integrating them in the most obscene ways possible, we’ve instead worked toward planning and architecting things properly so they are built correctly from the ground up. No more guess work. No more last minute decisions that come back to haunt us years down the road. Those kinds of things are the basic building blocks for automation.

When something is built along the lines of predictable rules with proper adherence to standards, it’s something that can be understood by a non-human. Going all the way back to Basic Computing 101, the inputs of a system determine the outputs. More simply, Garbage In, Garbage Out. If your network configuration looks like a messy pile of barely operational commands it will only really work when a human can understand what’s going on. Machines don’t guess. They do exactly what they are told to do. Which means that they tend to break when the decisions aren’t clear.

Cut Once

When a system, script, or program can read inputs and make procedural decisions on those inputs, you can make some very powerful things happen. Provided, that is, that your chosen language is powerful enough to do those things. I’m reminded of a problem I worked on fifteen years ago during my internship at IBM. I needed to change the MTU size for a network adapter in the Windows 2000 registry. My programming language of choice wasn’t powerful enough for me to say something like, “Read these values into an array and change the last 2 or 3 to the following MTU”. So instead, I built a nested if statement that was about 15 levels deep to ensure I caught every possible permutation of the adapter binding order. It was messy. It was ugly. And it worked. But there was no way it would scale.

The most important thing to realize about SDN and automation is that we’ve moved past simply understanding basic values. We’ve finally graduated to a place where programs can make complex decisions based on a number of inputs. We’ve graduated from simple if-then-else constructs and up to a point where programs can take a number of inputs and make decisions based on them. Sure, in many cases the inputs are simple little things like tags or labels. But what we’re gaining is the ability to process more and more of those labels. We can create provisioning scripts that ensure that prod never talks to dev. We can automate turn-up of a new switch with multiple VLANs on different ports through the use of labels and object classes. We can even extrapolate this to a policy-based network language that we can use to build a task once and execute it over and over again on different hardware because we’re doing higher level processing instead of being hamstrung by specific device syntax.

Automation is going to cost some people their jobs. That’s a given. Just like every other manufacturing position, the menial tasks of assembling simple pieces or performing repetitive tasks can easily be accomplished by a machine or software construct. But writing those programs and working on those machines is a new kind of job in and of itself. A humorous anecdote from the auto industry says that the introduction of robots onto assembly lines caused many workers to complain and threaten to walk off the job. However, one worker picked up the manual for the robot and realized that he could easily start working on the it instead of the assembly line.


Tom’s Take

Automation isn’t a magic bullet to fix all your problems. It only works if things are ordered and structured in such a way that you can predictably repeat tasks over and over. And it’s not going to stop with one script or process. You need to continue to build, change, and extend your environment. Which means that your job of programming switches should now be looked at in light of building the programs that program switches. Does it mean that you need to forget the basics of networking? No, but it does mean that they way in which you think about them will change.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.