The Mystery of Known Issues

I’ve spent the better part of the last month fighting a transient issue with my home ISP. I thought I had it figure out after a hardware failure at the connection point but it crept back up after I got back from my Philmont trip. I spent a lot of energy upgrading my home equipment firmware and charting the seemingly random timing of the issue. I also called the technical support line and carefully explained what I was seeing and what had been done to work on the problem already.

The responses usually ranged from confused reactions to attempts to reset my cable modem, which never worked. It took several phone calls and lots of repeated explanations before I finally got a different answer from a technician. It turns out there was a known issue with the modem hardware! It’s something they’ve been working on for a few weeks and they’re not entirely sure what the ultimate fix is going to be. So for now I’m going to have to endure the daily resets. But at least I know I’m not going crazy!

Issues for Days

Known issues are a way of life in technology. If you’ve worked with any system for any length of time you’ve seen the list of things that aren’t working or have weird interactions with other things. Given the increasing amount of interactions that we have with systems that are becoming more and more dependent on things it’s a wonder those known issue lists are miles long by now.

Whether it’s a bug or an advisory or a listing of an incompatibility on a site, the nature of all known issues is the same. They are things that don’t work that we can’t fix yet. They could be on a list of issues to resolve or something that may never be able to be fixed. The key is that we know all about them so we can plan around them. Maybe it’s something like a bug in a floating point unit that causes long division calculations to be inaccurate to a certain number of decimal places. If you know what the issue is you know how to either plan around it or use something different. Maybe you don’t calculate to that level of precision. Maybe you do that on a different system with another chip. Whatever the case, you need to know about the issue before you can work around it.

Not all known issues are publicly known. They could involve sensitive information about a system. Perhaps the issue itself is a potential security risk. Most advisories about remote exploits are known issues internally at companies before they are patched. While they aren’t immediately disclosed they are eventually found out when the patch is released or when someone discovers the same issue outside of the company researchers. Putting these kinds of things under an embargo of sorts isn’t always bad if it protects from a wider potential to exploit them. However, the truth must eventually come out or things can’t get resolved.

Knowing the Unknown

What happens when the reasons for not disclosing known problems are less than noble? What if the reasoning behind hiding an issue has more to do with covering up bad decision making or saving face or even keeping investors or customers from fleeing? Welcome to the dark side of disclosure.

When I worked from Gateway 2000 back in the early part of the millennium, we had a particularly nasty known issue in the system. The ultimate root cause was that the capacitors on a series of motherboards were made with poor quality controls or bad components and would swell and eventually explode, causing the system to come to a halt. The symptoms manifested themselves in all manner of strange ways, like race conditions or random software errors. We would sometimes spend hours troubleshooting an unrelated issue only to find out the motherboard was affected with “bad caps”.

The issue was well documented in the tech support database for the affected boards. Once we could determine that it was a capacitor issue it was very easy to get the parts replaced. Getting to that point was the trick, though. Because at the top of the article describing the problem was a big, bold statement:

Do Not Tell The Customer This Is A Known Issue!!!

What? I can’t tell them that their system has an issue that we need to correct before everything pops and shuts it down for good? I can’t even tell them what to look for specifically when we open the case? Have you ever tried to tell a 75-year-old grandmother to look for something “strange” in a computer case? You get all kinds of fun answers!

We ended up getting creative in finding ways to look for those issues and getting them replaced where we could. When I moved on to my next job working for a VAR, I found out some of those same machines had been sold to a customer. I opened the case and found bad capacitors right away. I told my manager and explained the issue and we started getting them replaced under warranty as soon as the first sign of problems happened. After the warranty expired we kept ordering good boards from suppliers until we were able to retire all of those bad machines. If I hadn’t have known about the bad cap issue from my help desk time I never would have known what to look for.

Known issues like these are exactly the kind of thing you need to tell your customers about. It’s something that impacts their computer. It needs to be fixed. Maybe the company didn’t want to have to replace thousands of boards at once. Maybe they didn’t want to have to admit they cut corners when they were buying the parts and now the money they saved is going to haunt them in increased support costs. Whatever the reason it’s not the fault of the customer that the issue is present. They should have the option to get things fixed properly. Hiding what has happened is only going to create stress for the relations between consumer and provider.

Which brings me back to my issue from above. Maybe it wasn’t “known” when I called the first time. But by the third or fourth time I called about the same thing they should have been able to tell me it’s a known problem with this specific behavior and that a fix is coming soon. The solution wasn’t to keep using the first-tier support fixes of resets or transfers to another department. I would have appreciated knowing it was an issue so I didn’t have to spend as much time upgrading and isolating and documenting the hell out of everything just to exclude other issues. After all, my troubleshooting skills haven’t disappeared completely!

Vendors and providers, if you have a known issue you should admit it. Be up front. Honestly will get you far in this world. Tell everyone there’s a problem and you’re working on a fix that you don’t have just yet. It may not make the customer happy at first but they’ll understand a lot more than hiding it for days or weeks while you scramble to fix it without telling anyone. If that customer has more than a basic level of knowledge about systems they’ll probably be able to figure it out anyway and then you’re going to be the one with egg on your face when they tell you all about the problem you don’t want to admit you have.


Tom’s Take

I’ve been on both sides of this fence before in a number of situations. Do we admit we have a known problem and try to get it fixed? Or do we get creative and try to hide it so we don’t have to own up to the uncomfortable questions that get asked about bad development or cutting corners? The answer should always be to own up to things. Make everyone aware of what’s going on and make it right. I’d rather deal with an honest company working hard to make things better than a dishonest vendor that miraculously manages to fix things out of nowhere. An ounce of honestly prevents a pound of bad reputation.

Should We Embrace Points of Failure?

failure-pic-1

There was a tweet making the rounds this last week that gave me pause. Max Clark said that we should embrace single points of failure in the network. His post was an impassioned plea to networking rock stars out there to drop redundancy out of their networks and instead embrace these Single Points of Failure (SPoF). The main points Mr. Clark made boil down to a couple of major statements:

  1. Single-device networks are less complex and easier to manage and troubleshoot. Don’t have multiple devices when an all-in-one works better.
  2. Consumer-grade hardware is cheaper and easier to understand, therefore it’s better. Plus, if you need a backup you can just buy a second one and keep it on the shelf.

I’m sure more networking pros out there are practically bristling at these suggestions. Others may read through the original tweet and think this was a tongue-in-cheek post. Let’s look at the argument logically and understand why this has some merit but is ultimately flawed.

Missing Minutes Matter

I’m going to tackle the second point first. The idea that you can use cheaper gear and have cold standby equipment just sitting on the shelf is one that I’ve heard of many times in the past. Why pay more money to have a hot spare or a redundant device when you can just stock a spare part and swap it out when necessary? If your network design decisions are driven completely by cost then this is the most appealing thing you could probably do.

I once worked with a technology director that insisted that we forgo our usual configuration of RAID-5 with a hot spare drive in the servers we deployed. His logic was that the hot spare drive was spinning without actually doing anything. If something did go wrong it was just as easy to slip the spare drive in by taking it off the shelf and firing it up then instead of running the risk that the drive might fail in the server. His logic seemed reasonable enough but there was one variable that he wasn’t thinking about.

Time is always the deciding factor in redundancy planning. In the world of backup and disaster recovery they use the acronym RTO, which stands for Recovery Time Objective. Essentially, how long do you want your systems to be offline before the data is restored? Can you go days without getting your data back? Or do you need it back up and running within hours or even minutes? For some organizations the RTO could even be measured in mere seconds. Every RTO measurement adds additional complexity and cost.

If you can go days without your data then a less expensive tape solution is best because it is the cheapest per byte stored and lasts forever. If your RTO is minutes or less you need to add hardware that replicates changes or mirrors the data between sites to ensure there is always an available copy somewhere out there. Time is the deciding factor here, just as it is in the redundancy example above.

Can your network tolerate hours of downtime while you swap in a part from off the shelf? Remember that you’re going to need to copy the configuration over to it and ensure it’s back up and running. If it is a consumer-grade device there probably isn’t an easy way to console in and paste the config. Maybe you can upload a file from the web GUI but the odds are pretty good that you’re looking at downtime at least in the half-hour range if not more. If your office can deal with that then Max’s suggestions should work just fine.

For organizations that need to be back up and running in less than hours, you need to have fault tolerance in your network. Redundant paths for traffic or multiple devices to eliminate single points of failure are the only way to ensure that traffic keeps flowing in the event of a hardware failure. Sure, it’s more complicated to troubleshoot. But the time you spend making it work correctly is not time you’re going to spend copying configurations to a cold device while users and stakeholders are yelling at you to get things back online.

All-In-Wonder

Let’s look at the first point here. Single box solutions are better because they are simple to manage and give you everything you could need. Why buy a separate switch, firewall, and access point when you can get them all in one package? This is the small office / branch office model. SD-WAN has even started moving down this path for smaller deployments by pushing all the devices you need into one footprint.

It’s not unlike the TVs you can buy in the big box stores that have DVD players, VHS players, and even some streaming services built in. They’re easy to use because there are no extra wires to plug in and no additional remote controls to lose. Everything works from one central location and it’s simple to manage. The package is a great solution when you need to watch old VHS tapes or DVDs from your collection infrequently.

Of course, most people understand the drawbacks of this model. Those devices can break. They are much harder to repair when they’re all combined. Worse yet, if the DVD player breaks and you need to get it repaired you lose the TV completely during the process instead of just the DVD player. You also can’t upgrade the components individually. Want to trade out that DVD for a Blu-Ray player? You can’t unless you install one on its own. Want to keep those streaming apps up-to-date? Better hope the TV has enough memory to keep current. Event state-of-the-art streaming boxes will eventually be incapable of running the latest version of popular software.

All-in-one devices are best left to the edges of the network. They function well in offices with a dozen or so workers. If something goes bad on the device it’s easier to just swap the whole thing instead of trying to repair the individual parts. That same kind of mentality doesn’t work quite so well in a larger data center. The fact that most of these unified devices don’t take rack mounting ears or fit into a standard data center rack should be a big hint that they aren’t designed for use in a place that keeps the networking pieces off of someone’s desk.


Tom’s Take

I smiled a bit when I read the tweet that started this whole post. I’m sure that the networks that Max has worked on work much better with consumer all-in-one devices. Simple configurations and cold spares are a perfectly acceptable solution for law offices or tag agencies or other places that don’t measure their downtime in thousands of dollars per second. I’m not saying he’s wrong. I’m saying that his solution doesn’t work everywhere. You can’t run the core of an ISP with some SMB switches. You should run your three-person law office with a Cat6500. You need to decide what factors are the most important for you. Don’t embrace failure without thought. Figure out how tolerant you or your customers are of failure and design around it as best you can. Once you can do that you’ll have a much better idea of how to build your network with the fewest points of failure.

Document The First Time, Every Time

2053fountain_pen

Imagine you’re deep into a massive issue. You’ve been troubleshooting for hours trying to figure out why something isn’t working. You’ve pulled in resources to help and you’re on the line with the TAC to try and get a resolution. You know this has to be related to something recent because you just got notified about it yesterday. You’re working through logs and configuration setting trying to gain insights into what went wrong. That’s when the TAC engineer hits you with with an armor-piecing question:

When did this start happening?

Now you’re sunk. When did you first start seeing it? Was it happening before and no one noticed? Did a tree fall in the forest and no one was around to hear the sound? What is the meaning of life now?

It’s not too hard to imagine the above scenario because we’ve found ourselves in it more times than we can count. We’ve started working on a problem and traced it back to a root cause only to find out that the actual inciting incident goes back even further than that. Maybe the symptoms just took a while to show up. Perhaps someone unknowingly “fixed” the issue with a reboot or a process reload over and over again until it couldn’t work any longer. How do we find ourselves in this mess? And how do we keep it from happening?

Quirky Worky

Glitches happen. Weird bugs crop up temporarily. It happens every day. I had to reboot my laptop the other day after being up for about two months because of a series of weird errors that I couldn’t resolve. Little things that weren’t super important that eventually snowballed into every service shutting down and forcing me to restart. But what was the cause? I can’t say for sure and I can’t tell you when it started. Because I just ignored the little glitches until the major ones forced me to do something.

Unless you work in security you probably ignore little strange things that happen. Maybe an application takes twice as long to load one morning. Perhaps you click on a link and it pops up a weird page before going through to the actual website. You could even see a storage notification in the logs for a router that randomly rebooted itself for no reason in the middle of the night. Occasional weirdness gets dismissed by us because we’ve come to expect that things are just going to act strangely. I once saw a quote from a developer that said, “If you think building a steady-state machine is easy, just look at how many issues are solved with a reboot.”

We tend to ignore weirdness unless it presents itself as a more frequent issue. If a router reboots itself once we don’t think much about it. The fourth time it reboots itself in a day we know we’ve got a problem we need to fix. Could we have solved it when the first reboot happened? Maybe. But we also didn’t know we were looking at a pattern of behavior either. Human brains are wired to look for patterns of things and pick them out. It’s likely an old survival trait of years gone by that we apply to current technology.

Notice that I said “unless you work in security”. That’s because the security folks have learned over many years and countless incidents that nothing is truly random or strange. They look for odd remote access requests or strange configuration changes on devices. They wonder why random IP addresses from outside the company are trying to access protected systems. Security professionals treat every random thing as a potential problem. However, that kind of behavior also demonstrates the downside of analyzing every little piece of information for potential threats. You quickly become paranoid about everything and spend a lot of time and energy trying to make sense out of potential nonsense. Is it any wonder that many security pros find themselves jumping at every little shadow in case it’s hiding a beast?

Middle Ground of Mentions

On the one hand, we have systems people that dismiss weirdness until it’s a pattern. On the other we have security pros that are trying to make patterns out of the noise. I’m sure you’re probably wondering if there has to be some kind of middle ground to ensure we’re keeping track of issues without driving ourselves insane.

In fact, there is a good policy that you need to get into the habit of doing. You need to write it all down somewhere. Yes, I’m talking about the dreaded documentation monster. The kind of thing that no one outside of developers likes to do. The mean, nasty, boring process of taking the stuff in your brain and putting it down somewhere so someone can figure out what you’re thinking without the need to read your mind.

You have to write it down because you need to have a record to work from if something goes wrong later. One of the greatest features I’ve ever worked with that seems to be ignored by just about everyone is the Windows Shutdown Reason dialog box in Windows Server 2003 and above. Rebooting a box? You need to write in why and give a good justification. That way if someone wants to know why the server was shut off at 11:15am on a Tuesday they can examine the logs. Unfortunately in my experience the usual reason for these shutdowns was either “a;lkjsdfl;kajsdf” or “because I am doing it”. Which aren’t great justifications for later.

You don’t have to be overly specific with your documentation but you need to give enough detail so that later you can figure out if this is part of a larger issue. Did an application stop responding and need to be restarted? Jot that down. Did you need to kill a process to get another thing running again? Write down that exact sentence. If you needed to restart a router and you ended up needing to restore a configuration you need to jot that part down too. Because you may not even realize you have an issue until you have documentation to point it out.

I can remember doing a support call years ago with a customer and in conversation he asked me if I knew much about Cisco routers. I chuckled and said I knew a bit. He said that he had one that he kept having to copy the configuration files to every time it restarted because it came up blank. He even kept a console cable plugged into it for just that reason. Any CCNA out there knows that’s probably a config register issue so I asked when it started happening. The person told me at least a year ago. I asked if anyone had to get into the router because of a forgotten password or some other lockout. He said that he did have someone come out a year and a half prior to reset a messed up password. Ever since then he had to keep putting the configuration back in. Sure enough, the previous tech hadn’t reset the config register. One quick fix and the customer was very happy to not have to worry about power outages any longer.

Documenting when things happen means you can build a timeline per device or per service to understand when things are acting up. You don’t even need to figure it out yourself. The magic of modern systems lies in machine learning. You may think to yourself that machine learning is just fancy linear regression at this point and you would be right more often than not. But one thing linear regression is great at doing is surfacing patterns of behavior for specific data points. If your router reboots on the third Wednesday of every month precisely at 3:33am the ML algorithms will pick up on that and tell you about it. But that’s only if your system catches the reboot through logs or some other record keeping. That’s why you have to document all your weirdness. Because the ML systems can analyze what they don’t know about.


Tom’s Take

I love writing. And I hate documentation. Documentation is boring, stuffy, and super direct. It’s like writing book reports over and over again. I’d rather write a fun blog post or imagine an exciting short story. However, documentation of issues is critical to modern organizations because these issues can spiral out of hand before you know it. If you don’t write it down it didn’t happen. And you need to know when it happened if you hope to prevent it in the future.

Basics First and Basics Last

This week I found my tech life colliding with my normal life in an unintended and somewhat enlightening way. I went to a store to pick up something that was out of stock and while I was there making small talk the person behind the counter asked me what I did for a living. I mentioned technology and he said that he was going to college for a degree in MIS, which just happens to be the thing I have my degree in. We chatted about that for a few more minutes before he asked me something I get asked all the time.

“What is the one thing I need to make sure I pay attention to in my courses?”

It’s simple enough, right? You’ve done this before and you have the benefit of hindsight. What is the one thing that is most important to know and not screw up? The possible answers floating through my head were all about programming or analytical methods or even the dreaded infrastructure class I slept through and then made a career out of. But what I said was the most boring and most critical answer one could give.

“You need to know the basics backwards and forwards.”

Basics Training

Why do we teach the basics? Why do we even call them that? And why are people so keen on skipping over all of them so fast to get to the cool stuff? You have to understand the basics before you even move on and yet so many want to get the “easy” stuff out of the way because memorizing the OSI model or learning how an array works in programming is mind-numbing.

The basics exist because we all need to know how things work at their most atomic level. We memorize the OSI model in networking because it tells us how things should behave. Sure, TCP/IP blows it away. However, if you know how packets are supposed to work with that model it informs you how you need to approach troubleshooting and software design and even data center layouts.

I’ll admit that I really didn’t pay much attention when I took my Infrastructure class twenty years ago. I was hell-bent on being a consultant or a database admin and who needed to know how a CPU register worked? What was this stupid OSI model they wanted me to know? I’ll just memorize it for the test and be done with it. Needless to say that the intervening years have shown me the folly of not paying attention in that class. If I went back today I’d ace that OSI test with my eyes closed.

The basics seem useless because we can’t do much with them right now. They’re just like Lego bricks. We need uniform pieces with predictable characteristics to help us understand how things are supposed to work together. Without that knowledge of how things work you can’t build on it. If you don’t understand the different between RAM and a hard disk you won’t be able to build systems that rely on both. Better yet, when technology changes to incorporate solid state disks and persistent memory storage you need the basics to understand how they are different and where you need to apply that knowledge.

I once picked up a Cisco Press CCIE study guide for the written exam to brush up on my knowledge before retaking the written. The knowledge in the book seemed easy to me. It was all about spanning tree configurations and OSPF area types and what BGP keepalives were. I felt like it was a remedial text that didn’t give me any new knowledge. That’s when I realized that they knowledge in the book wasn’t supposed to be new. It was supposed to be a reminder of what I already learned in my CCNA and CCNP courses. If anything in the text was truly new, was it something I should have already known?

It’s also part of the reason the CCIE is such a fun exam in the lab. You should already know the basics of how things like RIP and OSPF work. So let’s test those basics in new ways. Any of the training lab you can take from companies like INE or Micronics are filled with tricky little scenarios that make you take the basics and apply them outside the box. That’s because the instructors don’t need to spend time teaching you how RIP forms neighbor relationships or adjacencies. They want to see if you remember how that happens so you can apply it to a question designed to stretch your knowledge. You can only do that when you know the basics.

Graduation Day

Basics aren’t just for learning at the beginning. You should also brush up on them when you’re at the top of your game. Why? Because it will answer questions you might not know you had or explain strange things that rely on the architecture we long-ago forgot about because it seemed basic.

A fun example was years ago in the online game City of Heroes. The players can earn in-game currency to buy and sell things. Eventually the game economy got to the point where players were at the maximum amount of currency for a player. What was that number? It was just over two billion. Pretty odd place to stop, right? What made them think that was a good stopping point? Random chance? Desire to keep the amount of currency in circulation low? Or was there a different reason?

That’s when I asked a simple question: How would you store the currency value in the game’s code? The answer for every programmer out there is an integer. And what’s the maximum value for an integer? For a 32-bit value it’s around four billion. But what if you use a signed integer for some reason? The maximum value is just over two billion in each direction. So the developers used a 32-bit signed integer and that’s why the currency value was capped where it was.

Over and over again in my career I find myself turning back to the basics to answer questions about things I need to understand or solve. We really want the solutions to be complex and hard to understand and solve because that shows our critical thinking skills being applied. However, if you start with the basics approach you’ll find that the solutions to problems or the root causes are often defined by something very basic that has far-reaching consequences. And if you forget how those basics work you’re going to spend a lot of time chasing your tail looking for a complex solution to a simple problem.


Tom’s Take

I don’t think my conversation partner was hoping for the answer I gave him. I’m sure he wanted me to say that this high level course was super important because it taught all the secrets you needed to know in order to succeed in life. Everyone wants to hear that the most important things are exciting and advanced. Finding out that the real key to everything is the basics you learn at the beginning of your journey is disappointing. However, for those that master the basics and remember them at every step of their journey, the end of the road is just as advanced and exciting as it was when you stepped on it in the first place. And you get there with a better understanding of how everything works.

Building Snowflakes On Purpose

We all know that building snowflake networks is bad, right? If it’s not a repeatable process it’s going to end up being a problem down the road. If we can’t refer back to documentation to shows why we did something we’re going to end up causing issues and reducing reliability. But what happens when a snowflake process is required to fix a bigger problem? It’s a fun story that highlights where process can break down sometimes.

Reloaded

I’ve mentioned before that I spent about six months doing telephone tech support for Gateway computers. This was back in 2003 so Windows XP was the hottest operating system out there. The nature of support means that you’re going to be spending more time working on older things. In my case this was Windows 95 and 98. Windows 98 was a pain but it was easy to work on.

One of the most common processes we had for Windows 98 was a system reload. It was the last line of defense to fix massive issues or remove viruses. It was something that was second nature to any of the technicians on the help desk:

  1. Boot from the Gateway tools CD and use GWSCAN to write zeros to the hard drive.
  2. Reboot from the CD and use FDISK to partition the hard disk.
  3. Format the drive.
  4. Insert the Windows 98 OS CD and copy the CAB installation files to a folder on the hard drive.
  5. Run setup from the hard drive and let it complete.
  6. When Windows comes back up, insert the driver CD and let install all the drivers.

The whole process took two or three phone calls to complete. Any time you got to something that would take more than fifteen minutes to complete you logged the steps in the customer trouble ticket and had them call back when it was completed. The process was so standard that it had its own acronym in the documentation – FFR, which stood for “FDISK, Format, Reload”. If you told someone where you were in the process they could finish it no problem.

Me, Me, ME

The whole process was manual with lots of steps and could intimidate customers. At some point in the development process the Gateway folks came up with a solution for Windows ME that they thought worked better. Instead of the manual steps of copying files and drivers and such, Gateway loaded the OS CD with a copy of the image they wanted for their specific model type. The image was installed using ImageCast, an imaging program that just dropped the image down on the drive without the need to do all the other steps. In theory, it was simple and reduced call times for the help desk.

In practice, Windows ME was a disaster to work on. The ImageCast program worked about half the time. If you didn’t pick the right options in the reload process it would partition the hard drive and add a second clean copy of WinME without removing the first one. It would change the MBR to have two installations to choose from with the same identifiers so users would get confused as to which was which. And the image itself seemed to be missing programs and drivers. The fact that there was a Driver CD that shipped with the system made us all wonder what the real idea behind this “improved” process was.

Because Windows ME was such a nightmare to reload, our call center got creative. We had a process that worked for Windows 98. We had all the files we needed on the disks. Why not do it the “right” way? So we did. Informally, the reload process for Windows ME was the same as Windows 98. We would FFR Windows ME boxes and make sure they looked right when they came back. No crazy ImageCasting programs or broken software loads.

The only issue? It was an informal snowflake process. It worked much better but if someone called the main help desk number and got another call center they would be in the middle of an unsupported process. The other call center tech would simply start the regular process and screw up the work we’d done already. To counter that, we would tell the customer to call our special callback voicemail box and not do anything until we called them back. That meant the reload process took many more hours for an already unhappy customer. The end result was better but could lead to frustrations.

Let It Snow

Was our informal snowflake process good or bad? It’s tough to say. It led to happier customers. It meant the likelihood of future support calls was lower because the system was properly reloaded instead of relying on a broken image. However, it was stressful for the tech that worked on the ticket because they had to own the process the whole way through. It also meant that you had to ensure the customer wouldn’t call back and disrupt the process with someone else on the phone.

The process was broken and it needed to be fixed. However, the way to fix the broken process wasn’t easy to figure out. The national line had their process and they were going to stick to it. We came up with an alternative but it wasn’t going to be adopted. Yet, we still kept using our process as often as possible because we felt we were right.

In your enterprise, you need to understand process. Small companies need processes that are repeatable for the ease of their employees. Large companies need processes to keep things consistent. Many companies that face regulatory oversight need processes followed exactly to ensure compliance. The ability to go rogue and just do it the way you want isn’t always desired.

If you have users that are going around the process you need to find out why. We see it all the time. Security rules that get ignored. Documentation requirements that are given the bare minimum effort. Remember the shutdown dialog box in Windows Server 2003? Most people did a token entry before rebooting. And having eighteen entries of “;lkj;kljl;k” doesn’t help figure out what’s going on.

Get your users to tell you why they need a snowflake process. Especially if there’s more than one person relying on it. The odds are very good that they’ve found hiccups that you need to address. Maybe it’s data entry for old systems. Perhaps it’s problem with the requirements or the order of the steps. Whatever it is you have to understand it and fix it. The snowflake may be a better way to do things. You have to investigate and figure it out otherwise your processes will be forever broken.


Tom’s Take

Inbound tech support is full of stories like these. We find a broken process and we go around it. It’s not always the best solution but it’s the one that works for us. However, building a toolbox full of snowflake processes and solutions that no one else knows about is a path to headaches. You need to model your process around what people need to accomplish and how they do it instead of just assuming they’re going to shoehorn their workflow into your checklist. If you process doesn’t work for your people, you’d better get a handle on the situation before you’re buried in a blizzard of snowflakes.

Friction Finders

Do you have a door that sticks in your house? If it’s made out of wood the odds are good that you do. The kind that doesn’t shut properly or sticks out just a touch too far and doesn’t glide open like it used to. I’ve dealt with these kinds of things for years and Youtube is full of useful tricks to fix them. But all those videos start with the same tip: you have to find the place where the door is rubbing before you can fix it.

Enterprise IT is no different. We have to find the source of friction before we can hope to repair it. Whether it’s friction between people and hardware, users and software, or teams going at each other we have to know what’s causing the commotion before we can repair it. Just like with the sticking door, adding more force without understand the friction points isn’t a long-term solution.

Sticky Wickets

Friction comes from a variety of sources. People don’t understand how to use a device or a program. Perhaps it’s a struggle to understand who is supposed to be in charge of a change control or a provisioning process. It could even be as simple as someone applying outside issues to their regular day and causing problems because their interactions with previously stable systems is stilted.

Whatever the reasons for the friction, we need to understand what’s going on before we can start fixing. If we don’t know what we’re trying to solve we’re just going to throw solutions at the wall until something sticks. Then we hope that was the fix and we move on. Shotgun troubleshooting is never a permanent solution to anything. Instead, we need to take repeatable, documented steps to resolve the points of friction.

Ironically enough, the easiest issue to solve is the interpersonal kind. When teams argue about roles or permissions or even who should have the desks at the front of the data center it’s almost always a problem of a person against another person. You can’t patch people. You can’t upgrade people. You can’t even remove people and replace them with a new model (unless it’s a much bigger issue and HR needs to solve it). Instead, it’s a chance to get people talking. Be productive and make sure that everyone knows the outcome is the resolution of the problem. It’s not name calling or posturing. Lay out what the friction point is and make people talk about that. If you can keep them focused on the problem at hand and not at each others’ throats you should be able to get everyone to recognize what needs to change.

Mankind Versus Machines

People fighting people is a people problem. But people against the enterprise system isn’t always a cut-and-dried situation. That’s because machines are predictable in so many different kinds of ways. You can be sure that what you’re going to get is the right answer every time but you may not be able to ask the right questions to find that answer the way you want to.

Think back to how many times you’ve diagnosed a problem only to hit a wall. Maybe it’s that the CPU utilization on a device is higher than it should be. What next? Is it some software application? A bug in the system? Maybe a piece of malware running rampant? If it’s a networking device is it because of a packet flow causing issues? Or a failed table lookup causing a process switch to happen? There are a multitude of things that need to be investigated before you can decide on a course of action.

This is why shotgun troubleshooting is so detrimental to reducing friction. How do we know that the solution we tried isn’t making the problem worse? I wrote over ten years ago about removing fixes that don’t address the problem and it’s still very true today. If you don’t back out the things that didn’t address the issue you’re not only leaving bad fixes in place but you are potentially causing problems down the line when those changes impact other things.

Finding the sources of friction in your systems takes in-depth troubleshooting. Your users just want the problem to go away. They don’t want to answer forty questions about when it started or what’s been installed. They don’t want the perfect solution that ensures the problem never comes back. They just want to be able to work right now and then keep moving forward. That means you need a two-step process of triage and investigation. Get the problem under control and then investigate the friction points. Figure out how it happened and fix it after you get the users back up and running.

Lastly, document the issue and what resolved it. Write it all down somewhere, even if it’s just in your own notes. But if you do that, make sure you have a way of indexing everything so you can refer back to it at some point in the future. Part of the reason why I started this blog was to write down solutions to problems I discovered or changed along the way to ensure that I could always look them up. If you can’t publish them on the Internet or don’t feel comfortable writing it all up, at least use a personal database system like Notion to keep it all in one searchable place. That way you don’t forget how clever you are and go back to reinventing the wheel over and over again every time the problem comes up.


Tom’s Take

Friction exists everywhere. Sometimes it’s necessary to keep us from sliding all over the road or keeping a rug in place in the living room. In enterprise IT friction can create issues with system and teams. Reducing it as much as possible keeps the teams working together and being as productive as possible. You can’t eliminate it completely and you shouldn’t remove it just for the sake of getting rid of something. Instead, analyze what you need to improve or fix and document how you did it. Like any door repair, the results should be immediate and satisfying.

Opening Up Remote Access with Opengear

Opengear OM2200

The Opengear OM2200

If you had told me last year at this time that remote management of devices would be a huge thing in 2020 I might have agreed but laughed quietly. We were traveling down the path of simultaneously removing hardware from our organizations and deploying IoT devices that could be managed easily from the cloud. We didn’t need to access stuff like we did in the past. Even if we did, it was easy to just SSH or console into the system from a jump box inside the corporate firewall. After all, who wants to work on something when you’re not in the office?

Um, yeah. Surprise, surprise.

Turns out 2020 is the Year of Having Our Hair Lit On Fire. Which is a catchy song someone should record. But it’s also the year where we have learned how to stand up 100% Work From Home VPN setups within a week, deploy architecture to the cloud and refactor on the fly to help employees stay productive, and institute massive change freezes in the corporate data center because no one can drive in to do a reboot if someone forgets to do commit confirmed or reload in 5.

Remote management has always been something that was nice to have. Now it’s something that you can’t live without. If you didn’t have a strategy for doing it before or you’re still working with technology that requires octal cables to work, it’s time you jumped into the 21st Century.

High Gear

Opengear is a company that has presented a lot at Tech Field Day. I remember seeing them for the first time when I was a delegate many, many years ago. As I have grown with Tech Field Day, so too have they. I’ve seen them embrace new technologies like cloud management and 4G/LTE connectivity. I’ve heard the crazy stories about fish farms and Australian emergency call boxes and even some stuff that’s too crazy to repeat. But the theme remains the same throughout it all. Opengear is trying to help admins keep their boxes running even if they can’t be there to touch them.

Flash forward to the Hair On Fire year, and Opengear is still coming right along. During the recent Tech Field Day Virtual Cisco Live Experience in June, they showed off their latest offerings for sweet, sweet hardware. Rob Waldie did a great job talking about their new line of NetOps Console servers here in this video:

Now, I know what you’re thinking. NetOps? Really? Trying to cash in on the marketing hype? I would have gone down that road if they hadn’t show off some of the cool things these new devices can do.

How about support for Docker containerized apps? Pretty sure that qualifies at NetOps, doesn’t it? Now, your remote console appliance is capable of doing things like running automation scripts and triggering complex logic when something happens. And, because containers are the way the cloud runs now, you can deploy any number of applications to the console server with ease. It’s about as close at an App Store model as you’re going to find, with some nerd knobs for good measure.

That’s not all though. The new line of console appliances also comes with an embedded trusted platform module (TPM) chip. You’ve probably seen these on laptops or other mobile devices. They do a great job of securing the integrity of the device. It’s super important to have if you’re going to deploy console servers into insecure locations. That way, no one can grab your device and do things they shouldn’t like tapping traffic or trying to do other nefarious things to compromise security.

Last but not least, there’s an option for 64GB of flash storage on the device. I like this because it means I can do creative things like back up configurations to the storage device on a regular basis just in case of an outage. If and when something happens I can just remote to the Opengear server, console to the device, and put the config back where it needs to be. Pretty handy if you have a device with a dying flash card or something that is subject to power issues on a regular basis. And with a LTE-A global cellular modem, you don’t have to worry about shipping the box to a country where it won’t work.


Tom’s Take

I realize that we’re not going to be quarantined forever. But this is a chance for us to see how much we can get done without being in the office. Remember all those budgets for fancy office chairs and the coffee service? They could go to buying Opengear console servers so we can manage devices without truck rolls. Money well spent on reducing the need for human intervention also means a healthier workforce. I trust my family to stay safe with our interactions. But if I have to show up at a customer site to reboot a box? Taking chances even under the best of circumstances. And the fewer chances we take in the short term, the healthier the long-term outlook becomes.

We may never get back to the world we had before. And we may never even find ourselves in a 100% Remote Work environment. But Opengear gives us options that we need in order to find a compromise somewhere in the middle.

If you’d like more information about Opengear’s remote access solutions, make sure you check out their website at http://Opengear.com

Disclaimer: As a staff member of Tech Field Day, I was present during Opengear’s virtual presentation. This post represents my own thoughts and opinions of their presentation. Opengear did not provide any compensation for this post, nor did they request any special consideration when writing it. The conclusions contained herein are mine alone and do not represent the views of my employer.

Failure Is Fine, Learning Is Mandatory

“Failure is a harsh teacher because it gives the test first and the lesson afterward.” — Vernon Law

I’m seeing a thread going around on Twitter today that is encouraging people to share their stories of failure in their career. Maybe it was a time they created a security hole in a huge application. Perhaps it was creating a routing loop in a global corporation. Or maybe it was something as simple as getting confused about two mailboxes and deleting the wrong one and realizing your mail platform doesn’t have undelete functionality.

We fail all the time. We try our hardest and whatever happens isn’t what we want. Some of those that fail just give up and assume that juggling isn’t for them or that they can never do a handstand. Others keep persevering through the pain and challenge and eventually succeed because they learn what they need to know in order to complete their tasks. Failure is common.

What is different is how we process the learning. Some people repeat the same mistakes over and over again because they never learn from them. In a professional setting, toggling the wrong switch when you create someone’s new account has a very low learning potential because it doesn’t affect you down the road. If you accidentally check a box that requires them to change their password every week you’re not going to care because it’s not on your account. However, if the person you do that to has some kind of power to make you switch it back or if the option puts your job in jeopardy you’re going to learn very quickly to change your behavior.

Object Failure

Here’s a quick one that illustrates how the motivation to learn from failure sometimes needs to be more than just “oops, I screwed up”. I’ll make it a bullet point list to save time:

  • Installed new phone system for school district
  • Used MGCP as the control protocol
  • Need to solve a PRI caller ID issue at the middle school
  • Gateway is at the high school
  • Need to see all the call in the system
  • Type debug mgcp packet detail in a telnet session
  • A. Telnet. Session.
  • Router locks up tight and crashes
  • Hear receptionist from the other room say, “Did you just hang up on me?”
  • Panic
  • Panic some more
  • Jump in my car and break a couple of laws getting across town to restart router that I’m locked out of
  • Panic a lot in the five minutes it takes to reboot and reassociate with CallManager
  • Swear I will never do that again

Yes, I did the noob CCIE thing of debugging packets on a processing device in production because I underestimated the power of phone calls as well as my own stupidity. I got better!

But I promise that if I’d have done this and it would have shut down one phone call or caused an issue for one small remote site I wouldn’t have leaned a lesson. I might even still be doing that today to look at issues. The key here is that I shut down call processing for the entire school district for 20 minutes at the end of the school day. You have no idea how many elementary school parents call the front office at the end of the day. I know now.

Lessons have more impact with stress. It’s something we see in a lot of situations where we train people about how to behavior in high pressure situations. I once witnessed a car accident right in front of me on a busy highway and it took my brain almost ten seconds to process that I needed to call emergency services (911 in the US) even though I had spent the last four years programming that dial peer into phone systems and dialing it for Calling Line ID verification. I’d practiced calling 911 for years and when I had to do it for real I almost forgot what to do. We have to know how people are going to react under stress. Or at least anticipate how people are going to behave. Which is why I always configured 9.911 as a dial peer.

Lessons Learned

The other important thing about failure is that you have to take stock of what you learn in the post-mortem. Even if it’s just an exercise you do for yourself. As soon as you realize you made a mistake you need to figure out how to learn from it and prevent that problem again. And don’t just say to yourself, “I’m never doing that again!” You need to think about what caused the issue and how you can ingrain the learning process into your brain.

Maybe it’s something simple like creating a command alias to prevent you from making the wrong typo again and deleting a file system. Maybe it’s forcing yourself to read popup dialog boxes as you click through the system to make sure you’re deleting the right file or formatting the right disk. Someone I used to work with would write down the name of the thing he was deleting and hold it up to the screen before he committed the command to be sure they matched. I never asked what brought that about but I’m sure it was a ton of stress.


Tom’s Take

I screw up. More often than even I realize. I try to learn as much as I can when I’m sifting through the ashes. Maybe it’s figuring out how I went wrong. Perhaps it’s learning why the thing I wanted to do didn’t work the way I wanted it to. It could even be as simple as writing down the steps I took to know where I went wrong and sharing that info with a bunch of strangers on the Internet to keep me from making the same mistake again. As long as you learn something you haven’t failed completely. And if you manage to avoid making the exact same mistake again then you haven’t failed at all.

AI and Trivia

questions answers signage

Photo by Pixabay on Pexels.com

I didn’t get a chance to attend Networking Field Day Exclusive at Juniper NXTWORK 2019 this year but I did get to catch some of the great live videos that were recorded and posted here. Mist, now a Juniper Company, did a great job of talking about how they’re going to be extending their AI-driven networking into the realm of wired networking. They’ve been using their AI virtual assistant, named “Marvis”, for quite a while now to solve basic wireless issues for admins and engineers. With the technology moving toward the copper side of the house, I wanted to talk a bit about why this is important for the sanity of people everywhere.

Finding the Answer

Network and wireless engineers are walking storehouses of useless trivia knowledge. I know this because I am one. I remember the hello and dead timers for OSPF on NBMA networks. I remember how long it takes BGP to converge or what the default spanning tree bridge priority is for a switch. Where some of my friends can remember the batting average for all first basemen in the league in 1971, I can instead tell you all about LSA types and the magical EIGRP equation.

Why do we memorize this stuff? We live in a world with instant search at our fingertips. We can find anything we might need thanks to the omnipotent Google Search Box. As long as we can avoid sponsored results and ads we can find the answer to our question relatively quickly. So why do we require people to memorize esoteric trivia? Is it so we can win free drinks at the bar after we’re done troubleshooting?

The problem isn’t that we have to know the answer. It’s that we need to know the answer in order to ask the right question. More often than now we find ourselves stuck in the initial phase of figuring out the problem. The results are almost always the same – things aren’t working. Finding the cause isn’t always easy though. We have to find some nugget of information to latch onto in order to start the process.

One of my old favorites was trying to figure out why a network I was working with had a segmented spanning tree. One side of the network was working just fine but there were three switches daisy chained together that didn’t. Investigations turned up very little. Google searches were failing me. It wasn’t until I keyed in on a couple of differences that I found out that I had improperly used a BPDU filtering command because of a scoping issue. Sure, it only took me two hours of searching to find it after I discovered the problem. But if I hadn’t memorized the BDPU filtering and guard commands and their behavior I wouldn’t have even known to ask about them. So it’s super important to know how every minutia of every protocol works, right?

Presenting the Right Questions

Not exactly. We, as human computers, memorize the answers to more efficiently search through our database to find the right answers. If the problem takes 5 minutes to present we can eliminate a bunch of causes. If it’s happening in layer 3 and not layer 2 we can toss out a bunch of other stuff. Our knowledge is allowing us to discard useless possibilities and focus on the right result.

And it’s horribly inefficient. I can attest to that given my various attempts to learn OSPF hello and dead timers through osmosis of falling asleep in my big CCNP Routing book. The answers don’t crawl off the page and into your brain no matter how loudly you snore into it. So I spent hours learning something that I might use two or three times in my career. There has to be a better way.

Not coincidentally, that’s where the AI-driven systems from Mist, and now Juniper, come into play. Marvis is wonderful at looking at symptoms and finding potential causes. It’s what we do as humans. Except Marvis has no inherent biases. It also doesn’t misremember the values for a given protocol or get confused about whether or not OSPF point-to-point networks are broadcast or not. Marvis just knows what it was programmed with. But it does learn.

Learning is the key to how these AI and machine learning (ML) driven systems have to operate. People tend to discount solutions because they think there’s no way it could be that solution this time. For example, a haiku:

It’s not DNS.
Could it be DNS?
It was DNS.

DNS is often the cause of our problems even if we usually discount it out of hand in the first five minutes of troubleshooting. Even if it was only DNS 50% of the time we would still toss DNS as the root cause within the first five minutes because we’ve “trained” our brains to know what a DNS problem looks like without realizing how many things DNS can really affect.

But AI and ML don’t make these false correlations. Instead, they learn every time what the cause was. They can look at the network and see the failure state, present options based on the symptoms, and even if you don’t check in your changes they can analyze the network and figure out what change caused everything to start working again. Now, the next time the problem crops up, a system like Marvis can present you with a list of potential solutions with confidence levels. If DNS is at the top of the list, you might want to look into DNS first.

AI is going to make us all better troubleshooters because it’s going to make us all less reliant on poor memory. Instead of misremembering how a protocol should be configure, AI and ML will tell us how it should look. If something is causing routing loops or if layer 2 issues are happening because of duplex mismatches we’ll be able to see that quickly and have confidence it’s the right answer instead of just guessing and throwing things at the wall until they stick. Just like Google has supplanted the Cliff Claven people at the bar that are storehouses of useless knowledge, so too will AI and ML reduce our dependence on know-it-alls that may not have all the answers.


Tom’s Take

I’m ready to be forgetful. I’m tired of playing “stump the chump” in troubleshooting with the network playing the part of the stumper and me playing the chump. I’ve memorized more useless knowledge than I ever care to recall in my life. But I don’t want to have to do the work any longer. Instead, I want to apply my gifts to training algorithms with more processing power than me to do all the heavy lifting. I’m more than happy to look at DNS and EIGRP timers than try to remember if MTU and reliability are part of the K-values for this network.

In Defense of Support

We’re all in IT. We’ve done our time in the trenches. We’ve…seen things, as Roy Batty might say. Things you wouldn’t believe. But in the end we all know the pain of trying to get support for something that we’re working on. And we know how painful that whole process can be. Yet, how is it that support is universally “bad” in our eyes?

One Of Us

Before we launch into this discussion, I’ll give you a bit of background on me. I did inbound tech support for Gateway Computers for about six months at the start of my career. So I wasn’t supporting enterprises to start with but I’ve been about as far down in the trenches as you can go. And that taught me a lot about the landscape of support.

The first thing you have to realize is that most Tier 1 support people are, in fact, not IT nerds. They don’t have a degree in troubleshooting OSPF or are signatories to the fibre channel standards. They are generally regular people. They get a week or two of training and off they go. In general the people on the other end of the support phone number are better phone people than they are technical people. The trainers at my old job told me, “We can teach someone to be technical. But it’s hard to find someone who is pleasant on the phone.”

I don’t necessarily agree with that statement but it seems to be the way that most companies staff their support lines. Why? Ask yourself a quick question: How many times has Tier 1 support solved your issue? For most of us the answer is probably “never” or “extremely rarely”. That’s because we, as IT professionals, have spent a large amount of our time and energy doing the job of a Tier 1 troubleshooter already. We pull device logs and look at errors and Google every message we can find until we hit a roadblock. Then it’s time to call. However, you’re already well past what Tier 1 can offer.

Look at it from the perspective of the company offering the support. If the majority of people calling me are already past the point of where my Tier 1 people can help them, why should I invest in those people? If all they’re going to do is take information down and relay the call to Tier 2 support how much knowledge do they really need? Why hire a rock star for a level of support that will never solve a problem?

In fact, this is basically the job of Tier 1 support. If it’s not a known issue, like an outage or a massive bug that is plastered up everywhere, they’re going to collect the basic information and get ready to pass the call to the next tier of support. Yes, it sucks to have to confirm serial numbers and warranty status with someone that knows less about VLANs than you do. But if you were in the shoes of the Tier 2 technician would you want to waste 10 minutes of the call verifying all that info yourself? Or would you rather put your talent to use to get the problem solved?

Think about all the channel partners and certification benefits that let you bypass Tier 1 support. They’re all focused on the idea that you know enough about the product to get to the next level to help you troubleshoot. But they’re also implying that you know the customer’s device is in warranty and that you need to pull log files and other data before you open a ticket so there’s no wasted time. But you still need to take the time to get the information to the right people, right? Would you want to start troubleshooting a problem with no log files and only a very basic description of the issue? And consider that half the people out there just put “Doesn’t Work” or “Acting Weird” for the problem instead of something more specific.

Sight Beyond Sight

This whole mess of gathering info ahead of time is one of the reasons why I’m excited about the coming of network telemetry and network automation. That’s because you now have access to all the data you need to send along to support and you don’t need to remember to collect it. If you’ve ever logged into a switch and run the show tech-support command you probably recall with horror the console spam that was spit out for minutes upon minutes. Worse yet, you probably remember that you forgot to redirect that spam to a file on the system to capture it all to send along with your ticket request.

Commands like show tech-support are the kinds of things that network telemetry is going to solve. They’re all-in-one monsters that provide way too much data to Tier 2 support techs. If the data they need is in there somewhere it’s better to capture it all and spend hours poring through a text file than talk to the customer about the specific issue, right?

Now, imagine the alternative. A system that notices a ticket has been created and pulls the appropriate log files and telemetry. The system adds the important data to the ticket and gets it ready to send to the vendor. Now, instead of needing to provide all the data to the Tier 1 folks you can instead open a ticket at the right level to get someone working on the problem. Mean Time to Resolution goes down and everyone is happy. Yes, that does mean that there are going to be fewer Tier 1 techs taking calls from people that don’t have a way to collect the data ahead of time but that’s the issue with automating a task away from someone. If the truth be told they would be able to find a different job doing the same thing in a different industry. Or they would see the value of learning how to troubleshoot and move up the ladder to Tier 2.

However, having telemetry gathered automatically still isn’t enough for some. The future is removing the people from the chain completely. If the system can gather telemetry and open tickets with the vendor what’s stopping it from doing the same proactively? We already have systems designed to do things like mail hard disk drives to customers when the storage array is degraded and a drive needs to be replaced. Why can’t we extend things like that to other parts of IT? What if Tier 2 support called you for a change and said, “Hey, we noticed your routing protocol has a big convergence issue in the Wyoming office. We think we have a fix so let’s remote in and fix it together.” Imagine that!


Tom’s Take

The future isn’t about making jobs go away. It’s about making everything more efficient. I don’t want to see Tier 1 support people lose their jobs but I also know that most people consider their jobs pointless already. No one wants to deal with the info takers and call routers. So we need to find a way to get the right information to the people that need it with a minimum of effort. That’s how you make support work for the IT people again. And when that day comes and technology has caught up to the way that we use support, perhaps we won’t need to defend them anymore.