In Defense of Support

We’re all in IT. We’ve done our time in the trenches. We’ve…seen things, as Roy Batty might say. Things you wouldn’t believe. But in the end we all know the pain of trying to get support for something that we’re working on. And we know how painful that whole process can be. Yet, how is it that support is universally “bad” in our eyes?

One Of Us

Before we launch into this discussion, I’ll give you a bit of background on me. I did inbound tech support for Gateway Computers for about six months at the start of my career. So I wasn’t supporting enterprises to start with but I’ve been about as far down in the trenches as you can go. And that taught me a lot about the landscape of support.

The first thing you have to realize is that most Tier 1 support people are, in fact, not IT nerds. They don’t have a degree in troubleshooting OSPF or are signatories to the fibre channel standards. They are generally regular people. They get a week or two of training and off they go. In general the people on the other end of the support phone number are better phone people than they are technical people. The trainers at my old job told me, “We can teach someone to be technical. But it’s hard to find someone who is pleasant on the phone.”

I don’t necessarily agree with that statement but it seems to be the way that most companies staff their support lines. Why? Ask yourself a quick question: How many times has Tier 1 support solved your issue? For most of us the answer is probably “never” or “extremely rarely”. That’s because we, as IT professionals, have spent a large amount of our time and energy doing the job of a Tier 1 troubleshooter already. We pull device logs and look at errors and Google every message we can find until we hit a roadblock. Then it’s time to call. However, you’re already well past what Tier 1 can offer.

Look at it from the perspective of the company offering the support. If the majority of people calling me are already past the point of where my Tier 1 people can help them, why should I invest in those people? If all they’re going to do is take information down and relay the call to Tier 2 support how much knowledge do they really need? Why hire a rock star for a level of support that will never solve a problem?

In fact, this is basically the job of Tier 1 support. If it’s not a known issue, like an outage or a massive bug that is plastered up everywhere, they’re going to collect the basic information and get ready to pass the call to the next tier of support. Yes, it sucks to have to confirm serial numbers and warranty status with someone that knows less about VLANs than you do. But if you were in the shoes of the Tier 2 technician would you want to waste 10 minutes of the call verifying all that info yourself? Or would you rather put your talent to use to get the problem solved?

Think about all the channel partners and certification benefits that let you bypass Tier 1 support. They’re all focused on the idea that you know enough about the product to get to the next level to help you troubleshoot. But they’re also implying that you know the customer’s device is in warranty and that you need to pull log files and other data before you open a ticket so there’s no wasted time. But you still need to take the time to get the information to the right people, right? Would you want to start troubleshooting a problem with no log files and only a very basic description of the issue? And consider that half the people out there just put “Doesn’t Work” or “Acting Weird” for the problem instead of something more specific.

Sight Beyond Sight

This whole mess of gathering info ahead of time is one of the reasons why I’m excited about the coming of network telemetry and network automation. That’s because you now have access to all the data you need to send along to support and you don’t need to remember to collect it. If you’ve ever logged into a switch and run the show tech-support command you probably recall with horror the console spam that was spit out for minutes upon minutes. Worse yet, you probably remember that you forgot to redirect that spam to a file on the system to capture it all to send along with your ticket request.

Commands like show tech-support are the kinds of things that network telemetry is going to solve. They’re all-in-one monsters that provide way too much data to Tier 2 support techs. If the data they need is in there somewhere it’s better to capture it all and spend hours poring through a text file than talk to the customer about the specific issue, right?

Now, imagine the alternative. A system that notices a ticket has been created and pulls the appropriate log files and telemetry. The system adds the important data to the ticket and gets it ready to send to the vendor. Now, instead of needing to provide all the data to the Tier 1 folks you can instead open a ticket at the right level to get someone working on the problem. Mean Time to Resolution goes down and everyone is happy. Yes, that does mean that there are going to be fewer Tier 1 techs taking calls from people that don’t have a way to collect the data ahead of time but that’s the issue with automating a task away from someone. If the truth be told they would be able to find a different job doing the same thing in a different industry. Or they would see the value of learning how to troubleshoot and move up the ladder to Tier 2.

However, having telemetry gathered automatically still isn’t enough for some. The future is removing the people from the chain completely. If the system can gather telemetry and open tickets with the vendor what’s stopping it from doing the same proactively? We already have systems designed to do things like mail hard disk drives to customers when the storage array is degraded and a drive needs to be replaced. Why can’t we extend things like that to other parts of IT? What if Tier 2 support called you for a change and said, “Hey, we noticed your routing protocol has a big convergence issue in the Wyoming office. We think we have a fix so let’s remote in and fix it together.” Imagine that!


Tom’s Take

The future isn’t about making jobs go away. It’s about making everything more efficient. I don’t want to see Tier 1 support people lose their jobs but I also know that most people consider their jobs pointless already. No one wants to deal with the info takers and call routers. So we need to find a way to get the right information to the people that need it with a minimum of effort. That’s how you make support work for the IT people again. And when that day comes and technology has caught up to the way that we use support, perhaps we won’t need to defend them anymore.

It’s A Wireless Problem, Right?

How many times have your users come to your office and told you the wireless was down? Or maybe you get a phone call or a text message sent from their phone. If there’s a way for people to figure out that the wireless isn’t working they will not hesitate to tell you about it. But is it always the wireless?

Path of Destruction

During CWNP Wi-Fi Trek 2019, Keith Parsons (@KeithRParsons) gave a great talk about Tips, Techniques, and Tools for Troubleshooting Wireless LAN. It went into a lot of detail about how many things you have to look at when you start troubleshooting wireless issues. It makes your head spin when you try and figure out exactly where the issues all lie.

However, I did have to put up a point that I didn’t necessarily agree with Keith on:

I spent a lot of time in the past working with irate customers in schools. And you can better believe that every time there was an issue it was the network’s fault. Or the wireless. Or something. But no matter what the issue ended up being, someone always made sure to remind me the next time, “You know, the wireless is flaky.”

I spent a lot of time trying to educate users about the nuances between signal strength, throughput, Internet uplinks, application performance (in the days before cloud!) and all the various things that could go wrong in the pathway between the user’s workstation and wherever the data was that they were trying to access. Wanna know how many times it really worked?

Very few.

Because your users don’t really care. It’s not all that dissimilar to the utility companies that you rely on in your daily life. If there is a water main break halfway across town that reduces your water pressure or causes you to not have water at all, you likely don’t spend your time troubleshooting the water system. Instead, you complain that the water is out and you move on with your day until it somehow magically gets fixed. Because you don’t have any visibility into the way the water system works. And, honestly, even if you did it might not make that much difference.

A Little Q&A

But educating your users about issues like this isn’t a lost cause. Because instead of trying to avail them of all the possible problems that you could be dealing with, you need to help them start to understand the initial steps of the troubleshooting process. Honestly, if you can train them to answer the following two questions you’ll likely get a head start on the process and make them very happy.

  1. When Did The Problem Start? One of the biggest issues with troubleshooting is figuring out when the problem started in the first place. How many times have you started troubleshooting something only to learn that the real issue has been going on for days or weeks and they don’t really remember why it started in the first place? Especially with wireless you have to know when things started. Because of the ephemeral nature of things like RSSI your issue could be caused by something crazy that you have no idea about. So you have to get your users to start writing down when things are going wrong. That way you have a place to start.
  2. What Were You Doing When It Happened? This one usually takes a little more coaching. People don’t think about what they’re doing when a problem happens outside of what is causing them to not be able to get anything done. Rarely do they even think that something they did caused the issue. And if they do, they’re going to try and hide it from you every time. So you need to get them to start detailing what was going on. Maybe they were up walking around and roamed to a different AP in a corner of the office. Maybe they were trying to work from the break room during lunch and the microwave was giving them fits. You have to figure out what they were doing as well as what they were trying to accomplish before you can really be sure that you are on the right track to solving a problem.

When you think about the insane number of things that you can start troubleshooting in any scenario, you might just be tempted to give up. But, as I mentioned almost a decade ago, you have to start isolating factors before you can really fix issues. As Keith mentioned in his talk, it’s way too easy to pick a rabbit hole issue and start troubleshooting to make it fit instead of actually fixing what’s ailing the user. Just like the scientific method, you have to make your conclusions fit the data instead of making the data fit your hypothesis. If you are dead set on moving an AP or turning off a feature you’re going to be disappointed when that doesn’t actually fix what the users are trying to tell you is wrong.


Tom’s Take

Troubleshooting is the magic that makes technical people look like wizards to regular folks. But it’s not because we have magical powers. Instead, it’s because we know instinctively what to start looking for and what to discard when it becomes apparent that we need to move on to a new issue. That’s what really separates a the best of the best. Being able to focus on the issue and not the next solution that may not work. Keith Parsons is one such wizard. Having him speak at Wi-Fi Trek 2019 was a huge success and should really help people understand how to look at these issues and keep them on the right troubleshooting track.

IT And The Exception Mentality

If you work in IT, you probably have a lot in common with other IT people. You work long hours. You have the attitude that every problem can be fixed. You understand technology well enough to know how processes and systems work. It’s fairly common in our line of work because the best IT people tend to think logically and want to solve issues. But there’s something else that I see a lot in IT people. We tend to focus on the exceptions to the rules.

Odd Thing Out

A perfectly good example of this is automation. We’ve slowly been building toward a future when software and scripting does the menial work of network administration and engineering. We’ve invested dollars and hours into making interfaces into systems that allow us to repeat tasks over and over again without intervention. We see it in other areas, like paperwork processing and auto manufacturing. There are those in IT, especially in networking, that resist that change.

If you pin them down on it, sometimes the answers are cut and dried. Loss of job, immaturity of software, and even unfamiliarity with programming are common replies. However, I’ve also heard a more common response growing from people: What happens when the automation screws up? What happens when a system accidentally upgrades things when it shouldn’t? Or a system disappears because it was left out of the scripts?

The exception position is a very interesting one to me. In part, it stems from the fact that IT people are problem solvers. This is especially true if you’ve ever worked in support or troubleshooting. You only see things that don’t work correctly. Whether it’s software misconfiguration, hardware failure, or cosmic rays, there is always something that is acting screwy and it’s your job to fix it. So the systems that you see on a regular basis aren’t working right. They aren’t following the perfect order that they should be. Those issues are the exceptions.

And when you take someone that sees the exceptions all day long and tries to fix them, you start looking for the exceptions everywhere in everything. Instead of wondering how a queue moves properly at the grocery store you instead start looking at why people aren’t putting things on the conveyor belt properly. You start asking whey someone brought 17 items into the 15-items-or-less line. You see all the problems. And you start trying to solve the issues instead of letting the process work.

Let’s step back to our automation example. What happens when you put the wrong upgrade image on a device? Was there a control in place to prevent you from doing that? Did you go around it because you didn’t think it was the right way to do something? I’ve heard stories of the upgrade process not validating images because they couldn’t handle the errors if the image didn’t match the platform. So the process relies on a person to validate the image is correct in the first place. And that much human interaction in a system will still cause issues. Because people make mistakes.

Outlook on Failure

But whether or not a script is going to make an error doesn’t affect the way we look at them and worry. For IT people that are too focused on the details, the errors are the important part. They tend to forget the process. They don’t see the 99% of devices that were properly upgraded by the program. Instead, they only focus on the 1% that weren’t. They’re only interested in being proved right by their suspicions. Because they only see failed systems they need the validation that something didn’t go right.

This isn’t to say that focusing on the details is a bad thing. It’s almost necessary for a troubleshooter. You can’t figure out where a process breaks if you don’t understand every piece of it. But IT people need to understand the big picture too. We’re not automating a process because we want to create exceptions. We’re doing it because we want to reduce stress and prevent common errors. That doesn’t mean that all errors are going to go away overnight. But it does mean that we can prevent common ones from occurring.

That also means that the need for expert IT people to handle the exceptions is even more important. Because if we can handle the easy problems, like VLAN typos or putting in the wrong subnet range, you can better believe that the real problems are going to take a really smart person to figure out! That means the real value of an expert troubleshooter is using the details to figure out this exception. So instead of worrying about why something might go wrong, you can instead turn your attention to figuring out this new puzzle.


Tom’s Take

I know how hard it is to avoid focusing on exceptions because I’m one of the people that is constantly looking for them. What happens if this redistribution crashes everything? What happens if this switch isn’t ready for the upgrade? What happens if this one iPad can’t connect to the wireless. But I realize that the Brave New World of IT automation needs people that can focus on the exceptions. However, we need to focus on them after they happen instead of worrying about them before they occur.

Outing Your Outages

How are you supposed to handle outages? What happens when everything around you goes upside down in an instant? How much communication is “too much”? Or “not enough”? And is all of this written down now instead of being figured out when the world is on fire?

Team Players

You might have noticed this week that Webex Teams spent most of the week down. Hard. Well, you might have noticed if you used Microsoft Teams, Slack, or any other messaging service that wasn’t offline. Webex Teams went offline about 8:00pm EDT Monday night. At first, most people just thought it was a momentary outage and things would be back up. However, as the hours wore on and Cisco started updating the incident page with more info it soon became apparent that Teams was not coming back soon. In fact, it took until Thursday for most of the functions to be restored from whatever knocked them offline.

What happened? Well, most companies don’t like to admit what exactly went wrong. For every CloudFlare or provider that has full disclosures on their site of outages, there are many more companies that will eventually release a statement with the least amount of technical detail possible to avoid any embarrassment. Cisco is currently in the latter category, with most guesses landing on some sort of errant patch that mucked things up big time behind the scenes.

It’s easy to see when big services go offline. If Netflix or Facebook are down hard then it can impact the way we go about our lives. On the occasions when our work tools like Slack or Google Docs are inoperable it impacts our productivity more than our personal pieces. But each and every outage does have some lessons that we can take away and learn for our own IT infrastructure or software operations. Don’t think that companies that are that big and redundant everywhere can’t be affected by outages regularly.

Stepping Through The Minefield

How do you handle your own outage? Well, sometimes it does involve eating some humble pie.

  1. Communicate – This one is both easy and hard. You need to tell people what’s up. You need to let everyone know things are working right and you’re working to make them right. Sometimes that means telling people exactly what’s affected. Maybe you can log into Facebook but not Chat or Messages. Tell people what they’re going to see. If you don’t communicate, you’re going to have people guessing. That’s not good.
  2. Triage – Figure out what’s wrong. Make educated guesses if nothing stands out. Usually, the culprits are big changes that were just made or perhaps there is something external that is affecting your performance. The key is to isolate and get things back as soon as possible. That’s why big upgrades always have a backout plan. In the event that things go sideways, you need to get back to functional as soon as you can. New features that are offline aren’t as good as tried-and-true stuff that’s reachable.
  3. Honest Post-Mortem – This is the hardest part. Once you have things back in place, you have to figure out why the broke. This is usually where people start running for the hills and getting evasive. Did someone apply a patch at the wrong time? Did a microcode update get loaded to the wrong system? How can this be prevented in the future? The answers to these questions are often hard to get because the people that were affected and innocent often want to find the guilty parties and blame someone so they don’t look suspect. The guilty parties want to avoid blame and hide in public with the rest of the team. You won’t be able to get to the bottom of things unless you find out what went wrong and correct it. If it’s a process, fix it. If it’s a person, help them. If it’s a strange confluence of unrelated events that created the perfect storm, make sure that can never happen again.
  4. Communicate (Again) – This is usually where things fall over for most companies. Even the best ones get really good at figuring out how to prevent problems. However, most of them rarely tell anyone else what happened. They hide it all and hope that no one ever asks about anything. Yet, transparency is key in today’s world. Services that bounce up and down for no reason are seen as unstable. Communicating as to their volatility is the only way you can make sure that people have faith that they’re going to stay available. Once you’ve figure out what went wrong and who did it, you need to tell someone what happened. Because the alternative is going to be second guessing and theories that don’t help anyone.

Tom’s Take

I don’t envy the people at Cisco that spent their entire week working to get Webex Teams back up and running. I do appreciate their work. But I want to figure out where they went wrong. I want to learn. I want to say to myself, “Never do that thing that they did.” Or maybe it’s a strange situation that can be avoided down the road. The key is communication. We have to know what happened and how to avoid it. That’s the real learning experience when failure comes around. Not the fix, but the future of never letting it happen again.

Tale From The Trenches: The Debug Of Damocles

My good friend and colleague Rich Stroffolino (@MrAnthropology) is collecting Tales from the Trenches about times when we did things that we didn’t expect to cause problems. I wanted to share one of my own here about the time I knocked a school offline with a debug command.

I Got Your Number

The setup for this is pretty simple. I was deploying a CallManager setup for a multi-site school system deployment. I was using local gateways at every site to hook up fax lines and fire alarms with FXS/FXO ports for those systems to dial out. Everything else got backhauled to a voice gateway at the high school with a PRI running MGCP.

I was trying to figure out why the station IDs that were being send by the sites weren’t going out over caller ID. Everything was showing up as the high school number. I needed to figure out what was being sent. I was at the middle school location across town and trying to debug via telnet. I logged into the router and figured I would make a change, dial my cell phone from the VoIP phone next to me, and see what happened. Simple troubleshooting, right?

I did just that. My cell phone got the wrong caller ID. And my debug command didn’t show me anything. So I figured I must be debugging the wrong thing. Not debug isdn q931 after all. Maybe it’s a problem with MGCP. I’ll check that. But I just need to make sure I’m getting everything. I’ll just debug it all.

debug mgcp packet detail

Can You Hear Me Now?

Veterans of voice are probably screaming at me right now. For those who don’t know, debug anything detail generates a ton of messages. And I’m not consoled into the router. I’m remote. And I didn’t realize how big of a problem that was until my console started scrolling at 100 miles an hour. And then froze.

Turns out, when you overwhelm a router CPU with debug messages, it shuts off the telnet window. It also shuts off the console as well, but I wouldn’t have known that because I was way far way from that port. But I did starting hearing people down the hall saying, “Hello? Hey, are you still there? Weird, it just went dead.”

Guess what else a router isn’t doing when it’s processing a tidal wave of debug messages? It’s not processing calls. At all. For five school sites. I looked down at my watch. It was 2:00pm. That’s bad. Elementary schools get a ton of phone calls within the last hour of being in session. Parents calling to tell kids to wait in a pickup line or ride a certain bus home. Parents wanting to check kids out early. All kinds of things. That need phones.

I raced out of my back room. I acknowledged the receptionists comment about the phones not working. I jumped in my car and raced across town to the high school. I managed not to break any speed limits, but I also didn’t loiter one bit. I jumped out of my car and raced into the building. The look on my face must have warded off any comments about phone system issues because no one stopped me before I got to the physical location of the voice gateway.

I knew things were bad. I didn’t have time to console in and remove the debug command. I did what ever good CCIE has been taught since the beginning of time when they need to remove a bad configuration that broke their entire lab.

I pulled the power cable and cycled the whole thing.

I was already neck deep in it. It would have taken me at least five minutes to get my laptop ready and consoled in. In hindsight, that would have been five wasted minutes since the MGCP debugger would have locked out the console anyway. As the router was coming back up, I nervously looked at the terminal screen for a login prompt. Now that the debugger wasn’t running, everything looked normal. I waiting impatiently for the MGCP process to register with CallManager once more.

I kept repeating the same status CLI command while I refreshed the gateway page in CallManager over and over. After a few more tense minutes, everything was back to normal. I picked up a phone next to the rack and dialed my cell phone. It rang. I was happy. I walked back to the main high school office and told them that everything was back to normal.

Tom’s Take

My post-mortem was simple. I did dumb things. I shouldn’t have debugged remotely. I shouldn’t have used the detail keyword for something so simple. In fact, watching my screen fill up with five sites worth of phone calls in a fraction of a second told me there was too much going on behind the scenes for me to comprehend anyway.

That was the last time I ever debugged anything in detail. I made sure from that point forward to start out small and then build from there to find my answers. I also made sure that I did all my debugging from the console and not a remote access window. And the next couple of times I did it were always outside of production hours with a hand ready to yank the power cable just in case.

I didn’t save the day. At best, all I did was cover for my mistake. If it had been a support call center or a hospital I probably would have been fired for it. I made a bad decision and managed to get back to operational without costing money or safety.

Remember when you’re doing your job that you need to keep an eye on how your job will affect everything else. We don’t troubleshoot in a vacuum after all.

Is Patching And Tech Support Bad? A Response

Hopefully, you’ve had a chance to watch this 7 minute video from Greg Ferro about why better patching systems can lead to insecure software. If you haven’t, you should:

Greg is right that moral hazard is introduced because, by definition, the party providing the software is “insured” against the risks of the party using the software. But, I also have a couple of issues with some of the things he said about tech support.

Are You Ready For The Enterprise

I’ve been working with some Ubiquiti access points recently. So far, I really enjoy them and I’m interested to see where their product is going. After doing some research, the most common issue with them seems to be their tech support offerings. A couple of Reddit users even posted in a thread that the lack of true enterprise tech support is the key that is keeping Ubiquiti from reaching real enterprise status.

Think about all the products that you’ve used over the last couple of years that offered some other kind of support aside from phone or rapid response. Maybe it was a chat window on the site. Maybe it was an asynchronous email system. Hell, if you’ve ever installed Linux in the past on your own you know the joys of heading to Google to research an issue and start digging into post after post on a long-abandoned discussion forum. DenverCoder9 would agree.

By Greg’s definition, tech support creates negative incentive to ship good code because the support personnel are there to pick up the pieces for all your mistakes when writing code. It even incentivizes people to set unrealistic goals and push code out the door to hit milestones. On the flip side, try telling people your software is so good that you don’t need tech support. It will run great on anything every time and never mess up once. If you don’t get laughed out of the building you should be in sales.

IT professionals are conditioned to need tech support. Even if they never call the number once they’re going to want the option. In the VAR world, this is the “throat to choke” mentality. When the CEO is coming down to find out why the network is down or his applications aren’t working, they are usually not interested in the solution. They want to yell at someone to feel like they’ve managed the problem. Likewise, tech support exists to find the cause of the problem in the long run. But it’s also a safety blanket that exists for IT professionals to have someone to yell at after the CEO gets to them. Tech Support is necessary. Not because of buggy code, but because of peace of mind.

How Complex Are You?

Greg’s other negative aspect to tech support is that it encourages companies to create solutions more complex than they need to be because someone will always be there to help you through them. Having done nationwide tech support for a long time, I can tell you that most of the time that complexity is there for a reason. Either you hide it behind a pretty veneer or you risk having people tell you that your product is “too simple”.

A good example case here is Meraki. Meraki is a well-known SMB/SME networking solution. Many businesses run on Meraki and do it well. Consultants sell Meraki by the boatload. You know who doesn’t like Meraki? Tinkerers. People that like to understand systems. The people that stay logged into configuration dashboards all day long.

Meraki made the decision when they started building their solution that they weren’t going to have a way to expose advanced functionality in the system to the tinkerers. They either buried it and set it to do the job it was designed to do or they left it out completely. That works for the majority of their customer base. But for those that feel they need to be in more control of the solution, it doesn’t work at all. Try proposing a Meraki solution to a group of die-hard traditionalist IT professionals in a large enterprise scenario. Odds are good you’re going to get slammed and then shown out of the building.

I too have my issues with Meraki and their ability to be a large enterprise solution. It comes down to choosing to leave out features for the sake of simplicity. When I was testing my Ubiquiti solution, I quickly found the Advanced Configuration setting and enabled it. I knew I would never need to touch any of those things, yet it made me feel more comfortable knowing I could if I needed to.

Making a solution overly complex isn’t a design decision. Sometimes the technology dictates that the solution needs to be complex. Yet, we knock solutions that require more than 5 minutes to deploy. We laud companies that hide complexity and claim their solutions are simple and easy. And when they all break we get angry because, as tinkerers, we can’t go in and do our best to fix them.


Tom’s Take

In a world where software in king, quality matters. But, so does speed and reaction. You can either have a robust patching system that gives you the ability to find and repair unforeseen bugs quickly or you can have what I call the “Blizzard Model”. Blizzard is a gaming company that has been infamous for shipping things “when they’re done” and not a moment before. If that means a project is 2-3 years late then so be it. But, when you’re a game company and this is your line of work you want it to be perfect out of the box and you’re willing to forego revenue to get it right. When you’re a software company that is expected to introduce new features on a regular cadence and keep the customers happy it’s a bit more difficult.

Perhaps creating a patching system incentivizes people to do bad things. Or, perhaps it incentivizes them to try new things and figure out new ideas without the fear that one misstep will doom the product. I guess it comes down to whether or not you believe that people are inherently going to game the system or not.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.