Is Patching And Tech Support Bad? A Response

Hopefully, you’ve had a chance to watch this 7 minute video from Greg Ferro about why better patching systems can lead to insecure software. If you haven’t, you should:

Greg is right that moral hazard is introduced because, by definition, the party providing the software is “insured” against the risks of the party using the software. But, I also have a couple of issues with some of the things he said about tech support.

Are You Ready For The Enterprise

I’ve been working with some Ubiquiti access points recently. So far, I really enjoy them and I’m interested to see where their product is going. After doing some research, the most common issue with them seems to be their tech support offerings. A couple of Reddit users even posted in a thread that the lack of true enterprise tech support is the key that is keeping Ubiquiti from reaching real enterprise status.

Think about all the products that you’ve used over the last couple of years that offered some other kind of support aside from phone or rapid response. Maybe it was a chat window on the site. Maybe it was an asynchronous email system. Hell, if you’ve ever installed Linux in the past on your own you know the joys of heading to Google to research an issue and start digging into post after post on a long-abandoned discussion forum. DenverCoder9 would agree.

By Greg’s definition, tech support creates negative incentive to ship good code because the support personnel are there to pick up the pieces for all your mistakes when writing code. It even incentivizes people to set unrealistic goals and push code out the door to hit milestones. On the flip side, try telling people your software is so good that you don’t need tech support. It will run great on anything every time and never mess up once. If you don’t get laughed out of the building you should be in sales.

IT professionals are conditioned to need tech support. Even if they never call the number once they’re going to want the option. In the VAR world, this is the “throat to choke” mentality. When the CEO is coming down to find out why the network is down or his applications aren’t working, they are usually not interested in the solution. They want to yell at someone to feel like they’ve managed the problem. Likewise, tech support exists to find the cause of the problem in the long run. But it’s also a safety blanket that exists for IT professionals to have someone to yell at after the CEO gets to them. Tech Support is necessary. Not because of buggy code, but because of peace of mind.

How Complex Are You?

Greg’s other negative aspect to tech support is that it encourages companies to create solutions more complex than they need to be because someone will always be there to help you through them. Having done nationwide tech support for a long time, I can tell you that most of the time that complexity is there for a reason. Either you hide it behind a pretty veneer or you risk having people tell you that your product is “too simple”.

A good example case here is Meraki. Meraki is a well-known SMB/SME networking solution. Many businesses run on Meraki and do it well. Consultants sell Meraki by the boatload. You know who doesn’t like Meraki? Tinkerers. People that like to understand systems. The people that stay logged into configuration dashboards all day long.

Meraki made the decision when they started building their solution that they weren’t going to have a way to expose advanced functionality in the system to the tinkerers. They either buried it and set it to do the job it was designed to do or they left it out completely. That works for the majority of their customer base. But for those that feel they need to be in more control of the solution, it doesn’t work at all. Try proposing a Meraki solution to a group of die-hard traditionalist IT professionals in a large enterprise scenario. Odds are good you’re going to get slammed and then shown out of the building.

I too have my issues with Meraki and their ability to be a large enterprise solution. It comes down to choosing to leave out features for the sake of simplicity. When I was testing my Ubiquiti solution, I quickly found the Advanced Configuration setting and enabled it. I knew I would never need to touch any of those things, yet it made me feel more comfortable knowing I could if I needed to.

Making a solution overly complex isn’t a design decision. Sometimes the technology dictates that the solution needs to be complex. Yet, we knock solutions that require more than 5 minutes to deploy. We laud companies that hide complexity and claim their solutions are simple and easy. And when they all break we get angry because, as tinkerers, we can’t go in and do our best to fix them.


Tom’s Take

In a world where software in king, quality matters. But, so does speed and reaction. You can either have a robust patching system that gives you the ability to find and repair unforeseen bugs quickly or you can have what I call the “Blizzard Model”. Blizzard is a gaming company that has been infamous for shipping things “when they’re done” and not a moment before. If that means a project is 2-3 years late then so be it. But, when you’re a game company and this is your line of work you want it to be perfect out of the box and you’re willing to forego revenue to get it right. When you’re a software company that is expected to introduce new features on a regular cadence and keep the customers happy it’s a bit more difficult.

Perhaps creating a patching system incentivizes people to do bad things. Or, perhaps it incentivizes them to try new things and figure out new ideas without the fear that one misstep will doom the product. I guess it comes down to whether or not you believe that people are inherently going to game the system or not.

Advertisements

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.

How Should We Handle Failure?

I had an interesting conversation this week with Greg Ferro about the network and how we’re constantly proving whether a problem is or is not the fault of the network. I postulated that the network gets blamed when old software has a hiccup. Greg’s response was:

Which led me to think about why we have such a hard time proving the innocence of the network. And I think it’s because we have a problem with applications.

Snappy Apps

Writing applications is hard. I base this on the fact that I am a smart person and I can’t do it. Therefore it must be hard, like quantum mechanics and figuring out how to load the dishwasher. The few people I know that do write applications are very good at turning gibberish into usable radio buttons. But they have a world of issues they have to deal with.

Error handling in applications is a mess at best. When I took C Programming in college, my professor was an actual coder during the day. She told us during the error handling portion of the class that most modern programs are a bunch of statements wrapped in if clauses that jump to some error condition in if they fail. All of the power of your phone apps is likely a series of if this, then that, otherwise this failure condition statements.

Some of these errors are pretty specific. If an application calls a database, it will respond with a database error. If it needs an SSL certificate, there’s probably a certificate specific error message. But what about those errors that aren’t quite as cut-and-dried? What if the error could actually be caused by a lot of different things?

My first real troubleshooting on a computer came courtesy of Lucasarts X-Wing. I loved the game and played the training missions constantly until I was able to beat them perfectly just a few days after installing it. However, when I started playing the campaign, the program would crash to DOS when I started the second mission with an out of memory error message. In the days before the Internet, it took a lot of research to figure out what was going on. I had more than enough RAM. I met all the specs on the side of the box. What I didn’t know and had to learn is that X-Wing required the use of Expanded Memory (EMS) to run properly. Once I decoded enough of the message to find that out, I was able to change things and make the game run properly. But I had to know that the memory X-Wing was complaining about was EMS, not the XMS RAM that I needed for Windows 3.1.

Generic error messages are the bane of the IT professional’s existence. If you don’t believe me, ask yourself how many times you’ve spent troubleshooting a network connectivity issue for a server only to find that DNS was misconfigured or down. In fact, Dr. House has a diagnosis for you:

Unintuitive Networking

If the error messages are so vague when it comes to network resources and applications, how are we supposed to figure out what went wrong and how to fix it? I mean, humans have been doing that for years. We take the little amount of information that we get from various sources. Then we compile, cross-reference, and correlate it all until we have enough to make a wild guess about what might be wrong. And when that doesn’t work, we keep trying even more outlandish things until something sticks. That’s what humans do. We will in the gaps and come up with crazy ideas that occasionally work.

But computers can’t do that. Even the best machine learning algorithms can’t extrapolate data from a given set. They need precise inputs to find a solution. Think of it like a Google search for a resolution. If you don’t have a specific enough error message to find the problem, you aren’t going to have enough data to provide a specific fix. You will need to do some work on your own to find the real answer.

Intent-based networking does little to fix this right now. All intent-based networking products are designed to create a steady state from a starting point. No matter how cool it looks during the demo or how powerful it claims to be, it will always fall over when it fails. And how easily it falls over is up to the people that programmed it. If the system is a black box with no error reporting capabilities, it’s going to fail spectacularly with no hope of repair beyond an expensive support contract.

It could be argued that intent-based networking is about making networking easier. It’s about setting things up right the first time and making them work in the future. Yet, no system in a steady state works forever. With the possible exception of the tar pitch experiment, everything will fail eventually. And if the control system in place doesn’t have the capability to handle the errors being seen and propose a solution, then your fancy provisioning system is worth about as much as a car with no seat belts.

Fixing The Issues

So, how do we fix this mess? Well, the first thing that we need to do is scold the application developers. They’re easy targets and usually the cause behind all of these issues. Instead of giving us a vague error message about network connectivity, we need more like lp0 Is On Fire. We need to know what was going on when the function call failed. We need a trace. We need context around process to be able to figure out why something happened.

Now, once we get all these messages, we need a way to correlate them. Is that an NMS or some other kind of platform? Is that an incident response system? I don’t know the answer there, but you and your staff need to know what’s going on. They need to be able to see the problems as they occur and deal with the errors. If it’s DNS (see above), you need to know that fast so you can resolve the problem. If it’s a BGP error or a path problem upstream, you need to know that as soon as possible so you can assess the impact.

And lastly, we’ve got to do a better job of documenting these things. We need to take charge of keeping track of what works and what doesn’t. And keeping track of error messages and their meanings as we work on getting app developers to give us better ones. Because no one wants to be in the same boat as DenverCoder9.


Tom’s Take

I made a career out of taking vague error messages and making things work again. And I hated every second of it. Because as soon as I became the Montgomery Scott of the Network, people started coming to me with every more bizarre messages. Becoming the Rosetta Stone for error messages typed into Google is not how you want to spend your professional career. Yet, you need to understand that even the best things fail and that you need to be prepared for it. Don’t spend money building a very pretty facade for your house if it’s built on quicksand. Demand that your orchestration systems have the capabilities to help you diagnose errors when everything goes wrong.

 

Is It Really Always The Network?

Keep Calm and Blame the Network

Image from Thomas LaRock

I had a great time over the last month writing a series of posts with my friend John Herbert (@MrTugs) over on the SolarWinds Geek Speak Blog. You can find the first post here. John and I explored the idea that people are always blaming the network for a variety of issues that are often completely unrelated to the actual operation of the network. It was fun writing some narrative prose for once, and the feedback we got was actually pretty awesome. But I wanted to take some time to explain the rationale behind my madness. Why is it that we are always blaming the network?!?

Visibility Is Vital

Think about all the times you’ve been working on an application and things start slowing down. What’s the first thing you think of? If it’s a standalone app, it’s probably some kind of processing lag or memory issues. But if that app connects to any other thing, whether it be a local network or a remote network via the Internet, the first culprit is the connection between systems.

It’s not a large logical leap to make. We have to start by assuming the the people that made the application knew what they were doing. If hundreds of other people aren’t having this problem, it must not be with the application, right? We’ve already started eliminating the application as the source of the issues even before we start figuring out what went wrong.

People will blame the most visible part of the system for issues. If that’s a standalone system sealed off from the rest of the world, it obviously must be the application. However, we don’t usually build these kinds of walled-off systems any longer. Almost every application in existence today requires a network connection of some kind. Whether it’s to get updates or to interact with other data or people, the application needs to talk to someone or maybe even everyone.

I’ve talked before about the need to make the network more of a “utility”. Part of the reason for this is that it lowers the visibility of the network to the rest of the IT organization. Lower visibility means fewer issues being incorrectly blamed on the network. It also means that the network is going to be doing more to move packets and less to fix broken application issues.

Blame What You Know

If your network isn’t stable to begin with, it will soon become the source of all issues in IT even if the network has nothing to do with the app. That’s because people tend to identify problem sources based on their own experience. If you are unsure of that, work on a consumer system helpdesk sometime and try and keep track of the number of calls that you get that were caused by “viruses” even if there’s no indication that this is a virus related issue. It’s staggering.

The same thing happens in networking and other enterprise IT. People only start troubleshooting problems from areas of known expertise. This usually breaks down by people shouting out solutions like, “I saw this once so it must be that! I mean, the symptoms aren’t similar and the output is totally different, but it must be that because I know how to fix it!”

People get uncomfortable when they are faced with troubleshooting something unknown to them. That’s why they fall back on familiar things. And if they constantly hear how the network is the source of all issues, guess what the first thing to get blamed is going to be?

Network admins and engineers have to fight a constant battle to disprove the network as the source of issues. And for every win they get it can come crashing down when the network is actually the problem. Validating the fears of the users is the fastest way to be seen as the issue every time.

Mean Time To Innocence

As John and I wrote the pieces for SolarWinds, what we wanted to show is that a variety of issues can look like network things up front but have vastly different causes behind the scenes. What I felt was very important for the piece was the distinguish the the main character, Amanda, go beyond the infamous Mean Time To Innocence (MTTI) metric. In networking, we all too often find ourselves going so far as to prove that it’s not the network and then leave it there. As soon as we’re innocent, it’s done.

Cross-function teams and other DevOps organizations don’t believe in that kind of boundary. Lines between silos blur or are totally absent. That means that instead of giving up once you prove it’s not your problem, you need to work toward fixing what’s at issue. Fix the problem, not the blame. If you concentrate on fixing the problems, it will soon become noticeable that networking team may not always be the problem. Even if the network is at fault, the network team will work to fix it and any other issues that you see.


Tom’s Take

I love the feedback that John and I have gotten so far on the series we wrote. Some said it feels like a situation they’ve been in before. Others have said that they applaud the way things were handled. I know that the narrative allows us to bypass some of the unsavory things that often happen, like argument and political posturing inside an organization to make other department heads look bad when a problem strikes. But what we really wanted to show is that the network is usually the first to get blamed and the last to keep its innocence in a situation like this.

We wanted to show that it’s not always the network. And the best way for you to prove that in your own organization is to make sure the network isn’t just innocent, but helpful in solving as many problems as possible.

Repeat After Me

callandresponse

Thanks to Tech Field Day, I fly a lot. As Southwest is my airline of choice and I have status, I tend to find myself sitting the slightly more comfortable exit row seating. One of the things that any air passenger sitting in the exit row knows by heart is the exit row briefing. You must listen to the flight attendant brief you on the exit door operation and the evacuation plan. You are also required to answer with a verbal acknowledgment.

I know that verbal acknowledgment is a federal law. I’ve also seen some people blatantly disregard the need to verbal accept responsibility for their seating choice, leading to some hilarious stories. But it also made me think about why making people talk to you is the best way to make them understand what you’re saying

Sotto Voce

Today’s society full of distractions from auto-play videos on Facebook to Pokemon Go parks at midnight is designed to capture the attention span of a human for a few fleeting seconds. Even playing a mini-trailer before a movie trailer is designed to capture someone’s attention for a moment. That’s fine in a world where distraction is assumed and people try to multitask several different streams of information at once.

People are also competing for noise attention as well. Pleasant voices are everywhere trying to sell us things. High volume voices are trying to cut through the din to sell us even more things. People walk around the headphones in most of the day. Those that don’t pollute what’s left with cute videos that sound like they were recorded in an echo chamber.

This has also led to a larger amount of non-verbal behavior being misinterpreted. I’ve experienced this myself on many occasions. People distracted by a song on their phone or thinking about lyrics in their mind may nod or shake their head in rhythm. If you ask them a question just before the “good part” and they don’t hear you clearly, they may appear to agree or disagree even though they don’t know what they just heard.

Even worse is when you ask someone to do something for you and they agree only to turn around and ask, “What was it you wanted again?” or “Sorry, I didn’t catch that.” It’s become acceptable in society to agree to things without understanding their meaning. This leads to breakdowns in communication and pieces of the job left undone because you assume someone was going to do something when they agreed, yet they agreed and then didn’t understand what they were supposed to do.

Fortississimo!

I’ve found that the most effective way to get someone to understand what you’ve told them is to ask you to repeat it back in their own words. It may sound a bit silly to hear what you just told them, but think about the steps that they must go through:

  • They have to stop for moment and think about what you said.
  • They then have to internalize the concepts so they understand them.
  • They then must repeat back to you those concepts in their own phrasing.

Those three steps mean that the ideas behind what you are stating or asking must be considered for a period of time. It means that the ideas will register and be remembered because they were considered when repeating them back to the speaker.

Think about this in a troubleshooting example. A junior admin is supposed to go down the hall and tell you when a link light comes for port 38. If the admin just nods and doesn’t pay attention, ask them to repeat those instructions back. The admin will need to remember that port 38 is the right port and that they need to wait until the link light is on before saying something. It’s only two pieces of information, but it does require thought and timing. By making the admin repeat the instructions, you make sure they have them down right.


Tom’s Take

Think about all the times recently when someone has repeated something back to you. A food order or an amount of money given to you to pay for something. Perhaps it was a long list of items to accomplish for an event or a task. Repetition is important to internalize things. It builds neural pathways that force the information into longer-term memory. That’s why a couple of seconds of repetition are time well invested.

Automating Change With Help From Fibonacci

FibonacciShell

A few recent conversations that I’ve seen and had with professionals about automation have been very enlightening. It all started with a post on StackExchange about an unsuspecting user that tried to automate a cleanup process with Ansible and accidentally erased the entire server farm at a service provider. The post was later determined to be a viral marketing hoax but was quite believable to the community because of the power of automation to make bad ideas spread very quickly.

Better The Devil You Know

Everyone in networking has been in a place where they’ve typed in something they shouldn’t have. Whether you removed the management network you were using to access the switch or created an access list that denied packets that locked you out of something. Or perhaps you typed an errant debug command that forced you to drive an hour to reboot a switch that was no longer responding. All of these things seem to happen to people as part of the learning process.

But how many times have we typed something in to create a change and found that it broke more than we expected? Like changing a native VLAN on a trunk and bringing down a link we didn’t intend to affect? These unforeseen accidents are the kinds of problems that can easily be magnified by scripts or automation.

I wrote a post about people blaming tools for SolarWinds a couple of months ago. In it, there was a story about a person that uploaded the wrong switch firmware to a server and used an automated tool to kick off an upgrade of his entire network. Only after the first switch failed to return to normal did he realize that he had downloaded an incorrect firmware to the server. And the command he used to kick off the upgrade was not the safe version of the command that checks for compatibility. Instead, it was the quick version of the command that copied the firmware directly into flash and rebooted the switch without confirmation.

While people are quick to blame tools for making mistakes race through the network quickly it should also be realized that those issues would be mistakes no matter what. Just because a system is capable of being automated doesn’t mean that your commands are exempt from being checked and rechecked. Too often a typo or an added word somewhere in the mix causes unintended chaos because we didn’t take the time to make sure there were no problems ahead of time.

Fibbonaci Style

I’ve always tried to do testing and regression in a controlled manner. For some places that have simulators and test networks to try out changes this method might still work. But for those of us that tend to fly by the seat of our pants on production devices, it’s best to artificially limit the damage before it becomes too great. Note that this method works with automation systems too provided you are controlling the logic behind it.

When you go out to test a network-wide change or perform an upgrade, pick one device as your guinea pig. It shouldn’t be something pushing massive production traffic like a core switch. Something isolated in the corner of the network usually works just fine. Test your change there outside production hours and with a fully documented blackout plan. Once you’ve implemented it, don’t touch anything else for at least a day. This gives the routing tables time to settle down completely and all of the aging timers a chance to expire and tables to recompile. Only then can you be sure that you’re not dealing with cached information.

Now that you’ve done it once, you’re ready to make it live to 10,000 devices, right? Abosolutely not. Now that you’ve proven that the change doesn’t cause your system to implode and take the network with it, you pick another single device and do it again. This time, you either pick a neighbor device to the first one or something on the other side of the network. The other side of the network ensures that changes don’t ripple across between devices over the 24-hour watch period. On the other hand, if the change involves direct connectivity changes between two devices you should test them to be sure that the links stay up. Much easier to recover one failed device than 40 or 400.

Once you follow the same procedure with the second upgrade and get clean results, it’s time to move to doing two devices at once. If you have a fancy automation system like Ansible or Puppet this is where you will be determining that the system is capable of handling two devices at once. If you’re still using scripts, this is where you make sure you’re pasting the right info into each window. Some networks don’t like two devices changing information at the same time. Your routing table shouldn’t be so unstable that a change like this would cause problems but you never know. You will know when you’re done with this.

Now that you’ve proven that you can make changes without cratering a switch, a link, or the entire network all at once, you can continue. Move on to three devices, then five, then eight. You’ll notice that these rollout plans are following the Fibonacci Sequence. This is no accident. Just like the appearance of these numbers in nature and math, having a Fibonacci rollout plan helps you evaluate changes and rollback problems before they grow out of hand. Just because you have the power to change the entire network at once doesn’t mean that you should.


Tom’s Take

Automation isn’t the bad guy here. We are. We are fallible creatures that make mistakes. Before, those mistakes were limited to how fast we could log into a switch. Today, computers allow us to make those mistakes much faster on a larger scale. The long-term solution to the problem is to test every change ahead of time completely and hope that your test catches all the problems. If it doesn’t then you had better hope your blackout plan works. But by introducing a rolling system similar to the Fibonacci sequence above. I think you’ll find that mistakes will be caught sooner and problems will be rectified before they are amplified. And if nothing else, you’ll have lots of time to explain to your junior admin all about the wonders of Fibonacci in nature.

Network Firefighters or Fire Marshals?

FireMarshal

Throughout my career as a network engineer, I’ve heard lots of comparisons to emergency responders thrown around to describe what the networking team does. Sometimes we’re the network police that bust offenders of bandwidth polices. Other times there is the Network SWAT Team that fixes things that get broken when no one else can get the job done. But over and over again I hear network admins and engineers called “fire fighters”. I think it’s time to change how we look at the job of fires on the network.

Fight The Network

The president of my old company used to try to motivate us to think beyond our current job roles by saying, “We need to stop being firefighters.” It was absolutely true. However, the sentiment lacked some of the important details of what exactly a modern network professional actually does.

Think about your job. You spend most of your time implementing change requests and trying to fix things that don’t go according to plan. Or figuring out why a change six months ago suddenly decided today to create a routing loop. And every problem you encounter is a huge one that requires an “all hands on deck” mentality to fix it immediately.

People look at networks as a closed system that should never break or require maintenance of any kind until it breaks. There’s a reason why a popular job description (and Twitter handle) involves describing your networking job as a plumber or janitor. We’re the response personnel that get called out on holidays to fix a problem that creates a mess for other people that don’t know how to fix it.

And so we come to the fire fighter analogy. The idea of a group of highly trained individuals sitting around the NOC (or firehouse) waiting for the alarm to go off. We scramble to get the problem fixed and prevent disaster. Then go back to our NOC and clean up to do it all over again. People think about networking professionals this way because they only really talk to us when there’s a crisis that we need to deal with.

The catch with networking is that we’re very rarely sitting around playing ping pong or cleaning our gear. In the networking world we find ourselves being pushed from one crisis to the next. A change here creates a problem there. A new application needs changes to run. An old device created a catastrophe when a new one was brought online. Someone did something without approval that prevented the CEO from watching Youtube. The list is long and distinguished. But it all requires us to fix things at the end of the day.

The problem isn’t with the fire fighting mentality. Our society wouldn’t exist without firefighters. We still have to have protection against disaster. So how is it that we can have an ordered society with relatively few firefighters versus the kinds of chaos we see in networking?

Marshal, Marshal, Marshal

The key missing piece is the fire marshal. Firefighters put out fires. Fire marshals prevent them from happening in the first place. They do this by enforcing the local fire codes and investigating what caused a fire in the first place.

The most visible reminder of the job of a fire marshal is found in any bar or meeting room. There is almost always a sign by the door stating the maximum occupancy according to the local fire code. Occupancy limits prevent larger disasters should a fire occur. Yes, it’s a bit of bummer when you are told that no one else can enter the bar until some people leave. But the fire marshal would rather deal with unhappy patrons than with injuries or deaths in case of fire.

Other duties of the fire marshal include investigating the root cause of a fire and trying to find out how to prevent it in the future. Was is deliberate? Did it involve a faulty piece of equipment? These are all important questions to ask in order to find out how to keep things from happening all over again.

Let’s apply these ideas to the image of network professionals as firefighters. Instead of spending the majority of time dealing with the fallout from bad decisions there should be someone designated to enforce network policy and explain why those policies exist. Rather than caving to the whims of application developers that believe their program needs elevated QoS treatment, the network fire marshal can explain why QoS is tiered the way that it is and why including this new app in the wrong tier will create chaos down the road.

How about designating a network fire marshal to figure out why there was a routing table meltdown? Too often we find ourselves fixing the problem and not caring about the root cause so long as we can reassure ourselves that the problem won’t come back. A network fire marshal can do a post mortem investigation and find out which change caused the issue and how to prevent it in the future with proper documentation or policy changes.

It sounds simple and straightforward to many folks. Yet these are the things that tend to be neglected when the fires flare up and time becomes a premium resource. By having someone to fill the role of investigator and educator, we can only hope that network fires can be prevented before they start.


Tom’s Take

Networking can’t evolve if we spend the majority of our time dealing with the disasters we could have prevented with even a few minutes of education or caution. SDN seeks to prevent us from shooting off our own foot. But we can also help out by looking to the fire marshal as an example of how to prevent these network fires before they start. With some education and a bit of foresight we can reduce our workload of disaster response with disaster prevention. Imagine how much more time we would have to sit around the NOC and play ping pong?