About networkingnerd

Tom Hollingsworth, CCIE #29213, is a former network engineer and current organizer for Tech Field Day. Tom has been in the IT industry since 2002, and has been a nerd since he first drew breath.

Magical Mechanics

If you’re a fan of this blog, you’ve probably read my last post about the new SD-WAN magic quadrant that’s been making the rounds and generating discussion. Some people are smiling that this report places Cisco in an area other than leadership in the SD-WAN space. Others are decrying the report as being unfair and contradictory. I wanted to take another look at it given some new information and some additional thoughts on the results.

Fair and Square

The first thing I wanted to do is make sure that I was completely transparent with the way the Gartner Magic Quadrant (MQ) works. I have a very good idea thanks to a conversation with Andrew Lerner (@Fast_Lerner), who is the Research VP of Networking at Gartner. Andrew was nice enough to clarify my understanding of the MQ and accompanying documentation. I’ll quote him here to make sure I don’t get anything wrong:

In an MQ, we assess the overall vendors’ behavior and offering in the market. Product, service/support sales, marketing, innovation, etc. if a vendor has multiple products in a market and sells them regularly to the enterprise, they are part of the MQ assessment. Viable products are not “excluded”.

As you can see from Andrew’s explanation, the MQ takes into account all the aspects of a company. It’s not just a product. It’s the sales, marketing, and other aspects of the company that give the overall score for a company. So how does Gartner figure out how products and services? That’s where their Critical Capabilities documents come into play. They are focused exclusively on products and services. They don’t take marketing or sales or anything else into account.

According to Andrew, when Gartner did their Critical Capabilities document on Cisco, they looked at Meraki MX and IOS-XE only. Viptela vEdge was not examined. So, the CC documents give us the Gartner overview of technology behind the MQ analysis. While the CC documents are focused solely on the Meraki MX and IOS-XD SD-WAN technology, they are components of the overall analysis in the MQ that was published.

What does that all mean in the long run?

Axis and Allies

In order to break this down a bit further, let’s ignore the actual quadrants in the MQ for the moment. Instead, let’s think about this picture as a graph. One axis of the graph is “Ability to Execute,” or in other words can the company do what they say they’re going to do? The other axis is “Completeness of Vision.” This is a question of how the company has rounded out their understanding of the market forces and direction. I want to use this sample MQ with some labels thrown in to help my readers understand how each axis can affect the ranking a company gets:

So, let’s look at where Cisco was situated on those graphs. They are not in the upper right part of the graph, which is the “good” part according to what most people will tell you when they glance at it. Cisco was ranked almost directly in the middle of the graph. Why would that be?

Let’s look at the Execution axis. Why would Cisco have some issues with execution on SD-WAN? Well, the biggest one is probably the shift to IOS-XE and the issues with code quality. Almost everyone involved deploying IOS-XE has told me that Cisco had significant issues in the early releases. Daniel Dib (@DanielDibSwe) had a great conversation with me about the breakdown between the router code and the controller code a week ago. Here’s the first tweet in the chain:

So, there were issues that have been addressed. But is the code completely stable right now? I can’t say, since I haven’t deployed it. But ask around and see what the common wisdom is. I would be genuinely interested to hear how much better things have gotten in the last six months. But, those code quality issues from months ago are going to be a concern in the report. And you can’t guarantee that every box is going to be running on the latest code. Having issues with stable code is going to impact your ability to execute on your vision.

Now, let’s look at the Completeness of Vision axis. If you reference the above picture you’ll see that the Challengers square represents companies that don’t yet understand the market direction. Given that this was the location that Cisco was placed in (barely), let’s examine why that might be. Let’s start by asking “which Cisco product is best for my SD-WAN solution?” Do you know for sure? Which one are you going to be offered?

In my last post, I said that Cisco had been deemphasizing vEdge deployments in favor of IOS-XE. But I completely forgot about Meraki as an additional offering for SMBs and smaller deployments. Where does Meraki make the most sense? And how big does your network need to be before it outgrows a Meraki deployment? Are you chasing a feature? Or maybe you need a service that isn’t offered natively on the Meraki platform? All of these questions need to be answered when you look at what you’re going to do with SD-WAN.

The other companies that Cisco will tell you are their biggest competitors are probably VMware and Silver Peak. How many SD-WAN platforms do they sell? How about companies ranked closer to Cisco in the MQ like Citrix, CloudGenix, or Versa Networks? How many SD-WAN solutions do they offer? How about HPE, Juniper and Aryaka?

In almost every case, the answer is “one”. Each of these companies have settled on a single solution for SD-WAN or SD-Branch. They don’t split those deployments across different product lines. They may have different sized boxes for things, but they all run the same common software. Can you integrate an IOS-XE SD-WAN appliance into a Meraki deployment? Can you take a Meraki MX and make it work with vEdge?

You may be starting to see that the completeness of Cisco’s vision isn’t lacking in SD-WAN but instead how they’re going to accomplish it. Rather than having one solution that can be scaled to fit all needs, Cisco is choosing to offer two different solutions for SMBs and enterprises. And if you count vEdge as a separate product from IOS-XE, as Cisco has suggested in some of their internal reports, then you have three products! I’m not saying that Cisco doesn’t have a vision. But it really looks like that vision is hazier than it should be.

If Cisco had a unified vision with stable code that was integrated up and down the stack I have no doubts they would have been rated higher. When Cisco was deploying Viptela vEdge as the sole SD-WAN offering they had it was much easier to figure out how everything was going to integrate together. But, just like all transitions, the devil is in the details here as Cisco tries to move to IOS-XE. Code quality is going to be a potential source of problems no matter what. But if you are staking your reputation on moving everyone to a single code base from a different more stable one you had better get it right quickly. Otherwise you’re going to get it counted against you.


Tom’s Take

I really appreciate Andrew Lerner for reaching out regarding my analysis of the MQ. I will admit I wasn’t exactly spot on with the differences between the MQ and the Critical Capabilities documents. But the results are close. The CC analyzes Cisco’s big platforms and tells people how they work. The MQ takes everything as a whole and gives Cisco a ranking of where they stand in the market. Sales numbers aside, do you think Cisco should be a leader in the SD-WAN space? Do you feel that where they are today with IOS-XE and Meraki MX is a complete vision? If you do then you likely won’t care about the MQ either way. But for those that have questions about execution and vision, maybe it’s time to figure out what might have caused Cisco to be right in the middle this time around.

SD-WAN Squares and Perplexing Planes

The latest arcane polygon is out in the SD-WAN space. Normally, my fortune telling skills don’t involve geometry. I like to talk to real people about their concerns and their successes. Yes, I know that the gardening people do that too. It’s just that no one really bothers to read their reports and instead makes all their decisions based on boring wall art.

Speaking of which, I’m going to summarize that particular piece of art here. Note this isn’t the same for copyright reasons but close enough for you to get the point:

4D8DB810-3618-44EA-8AA2-99EB7EAA3E45

So, if you can’t tell by the colors here, the big news is that Cisco has slipped out of the top Good part of the polygon and is now in the bottom Bad part (denoted by the red) and is in danger of going out of business and being the laughing stock of the networking community. Well, no, not so much that last part. But their implementation has slipped into the lower part of the quadrant where first-stage startups and cash-strapped companies live and wish they could build something.

Cisco released a report rebutting those claims and it talks about how Viptela is a huge part of their deployments and that they have mindshare and marketshare. But after talking with some colleagues in the networking industry and looking a bit deeper into the issue, I think Cisco has the same problem that Boeing had this year. Their assurances are inconsistent with reality.

A WANderful World

When you deploy Cisco SD-WAN, what exactly are you deploying? In the past, the answer was IWAN. IWAN was very much an initial attempt to create SD-WAN. It wasn’t perfect and it was soon surpassed by other technologies, many of which were built by IWAN team members that left to found their own startups. But Cisco quickly realized they needed to jump in the fray and get some help. That’s when the bought Viptela.

Here’s where things start getting murky. The integration of Viptela SD-WAN has been problematic. The initial sales model after the acquisition was to continue to sell Viptela vEdge to the customer. The plan was then to integrate the software on a Cisco ISR, then to integrate the Viptela software into IOS-XE. They’ve been selling vEdge boxes for a long while and are supporting them still. But the transition plan to IOS-XE is in full effect. The handwriting on the wall is that soon Cisco will only offer IOS-XE SD-WAN for sale, not vEdge.

Flash forward to 2019. A report is released about the impact and forward outlook for SD-WAN. It has a big analyst name attached to it. In that report, they leave out the references to vEdge and instead grade Cisco on their IOS-XE offering which isn’t as mature as vEdge. Or as deployed as vEdge. Or, as stated by even Cisco, as stable as vEdge, at least at first. That means that Cisco is getting graded on their newest offering. Which, for Cisco, means they can’t talk about how widely deployed and stable vEdge is. Why would this particular analyst firm do that?

Same is Same

The common wisdom is that Gartner graded Cisco on the curve that the sales message is that IOS-XE is the way and the future of SD-WAN at Cisco. Why talk about what came before when the new hot thing is going to be IOS-XE? Don’t buy this old vEdge when this new ISR is the best thing ever.

Don’t believe me? Go call your Cisco account team and try to buy a vEdge. I bet you get “upsold” to an ISR. The path forward is to let vEdge fade away. So it only makes sense that you should be grading Cisco on what their plans are, not on what they’ve already sold. I’ll admit that I don’t put together analyst reports with graphics very often, so maybe my thinking is flawed. But if I call your company and they won’t sell me the product that you want me to grade you on, I’m going to guess I shouldn’t grade you on it.

So Cisco is mad that Gartner isn’t grading them on their old solution and is instead looking at their new solution in a vacuum. And since they can’t draw on the sales numbers or the stability of their existing solution that you aren’t going to be able to buy without a fight, they slipped down in the square to a place that doesn’t show them as the 800-lb gorilla. Where have a heard something like this before?

Plane to See

The situation that immediately sprung to mind was the issue with Boeing’s 737-MAX airliner. In a nutshell, Boeing introduced a new airliner with a different engine and configuration that changed its flight profile. Rather than try to get the airframe recertified, which could take months or years, they claimed it was the same as the old one and just updated training manuals. They also didn’t disclose there was a new software program that tried to override the new flight characteristics and that particular “flaw” caused the tragic crashes of two flights full of people.

I’m not trying to compare analyst reports to tragedies in any way. But I do find it curious that companies want their new stuff to be treated just like their old stuff with regards to approvals and regulation and analysis. They know that the new product is going to have issues and concerns but they don’t want you to count those because remember how awesome our old thing was?

Likewise, they don’t want you to count the old problems against them. Like the Pinto or the Edsel from Ford, they don’t want you to think their old issues should matter. Just look at how far we’ve come! But that’s not how these things work. We can’t take the old product into account when trying to figure out how the new one works. We can’t assume the new solution is the same as the old one without testing, no matter how much people would like us to do that. It’s like claiming the Space Shuttle was as good as the Saturn V rocket because it went into space and came from NASA.

If your platform has bugs in the first version, those count against you too. You have to work it all out and realize people are going to remember that instability when they grade your product going forward. Remember that common wisdom says not to install an operating system update until the first service patch. You have to consider that reputation every time you release a patch or an update.


Tom’s Take

The SD-WAN MQ tells me that Cisco isn’t getting any more favors from the analysts. The marketing message not lining up with the install base is the heart of a transition away from hardware and software that Cisco doesn’t own. The hope that they could just smile and shrug their shoulders and hope that no one caught on has been dashed. Instead, Cisco now has to realize their going to have to earn that spot back through good code releases and consistent support and licensing. No one is going to give them any slack with SD-WAN like they have with switches and unified communications. If Cisco thinks that they’re just going to be able to bluff their way through this product transition, that idea just won’t fly.

Fast Friday- Perry Mason Moments

It’s the Thanksgiving holiday weekend in the US which means lots of people discussing things with their relatives. And, as is often the case, lots of arguments. It’s the nature of people to have a point of view and then to want to defend it. And it’s not just politics or other divisive topics. We see it all the time in networking too.

EIGRP vs OSPF. Cisco vs Juniper. ACI vs NSX. You name it and we’ve argued about it. Every viewpoint has a corresponding counterpart. Yes, there are good points for using one versus the other. But there are also times when every piece of factual information doesn’t matter because we “know” the right answer.

It’s those times when we run into what I call the “Perry Mason Problem”. It’s a reminder of the old Perry Mason TV show when the lawyer in the title would win a case with a carefully crafted statement that just ends any arguments. It’s often called a Wham Line or an Armor-Piercing Question. Basically, Mr. Mason would ask a question or make a statement that let all the air out of the argument. And often it would result in him winning the case without any further discussion.

Most people that argue keep searching for that magic Perry Mason moment. They want to win and they want to do it decisively. They want to leave their opponent speechless. They want to drop a microphone and walk away victorious.

And yet, that almost never happens in real life. No one is swayed by a single statement. No one changes their mind because of a question no matter how carefully crafted. Sure, it can add to them deciding to change their mind in the long run. But that process happens in the person. It doesn’t happen because of someone else.

So, if you find yourself in the middle of heated discussion don’t start looking for Perry Mason moments. Instead, you need to think about why you’re trying to change someone’s mind. Instead of trying to win an argument it’s better to consider your position and understand why you’re trying to win. I bet a little introspection will do a lot more good than looking for that wham line. You may not necessarily agree with their perspective but I bet you’ll have an easier time living with that than trying to change the mind of someone that’s set against it.

Five Minutes To Magic Time

Have you ever worked with someone that has the most valuable time in the world? Someone that counts each precious minute in their presence as if you’re keeping them from something very, very important that they could use to solve world hunger or cure cancer? If you haven’t then you’re a very lucky person indeed. Sadly, almost everyone, especially those in IT, has had the misfortune to be involved with someone whose time is more precious than platinum-plated saffron.

That’s not to say that we should be wasting the time of those we work with. Simple things like being late to meetings or not having your materials prepared are easy ways to help reduce the time of meetings or to make things run smoothly. Those items are common courtesies that should be extended to all the people you meet, from the cashier that takes your order at a fast food establishment to the most powerful people on the planet. No, this is about something deeper and more insidious.

No Time For Hugs

I’ve seen the kind of behavior I’ve described very often in the higher echelons of companies. People that live at the CxO level often have very little time to devote to anything that resembles thought. They’re busy strategizing and figuring out ways to keep the company profitable. They don’t have time to listen to people talk. Talking interrupts their brain functions. They need time to think.

If you think I’m being hyperbolic, ask yourself how many times you’ve been told to “simplify” something for a CEO when you present to them (if you’re even given the opportunity). I think this strip from Dilbert explains it succinctly:

https://dilbert.com/strip/1998-01-10

The higher up the food chain you go, the simpler it needs to be. But if the CEO is the most important person in the company, how is it that you need to make things easy for them to understand? They aren’t morons, right? They got this job somehow?

The insinuation is that the reason why you need to make it simple for them is because their time is too valuable. Needless talking and discussion takes them away from things that are more important. Like thinking and strategizing. Or something. So, if their time is the most value in the room, what does that say about your time? How does it feel to know that your efforts and research and theorizing are essentially wasted work because your time isn’t as important as the person you’re talking to?

This is even more egregious when you realize that your efforts to summarize something down to the most basic level is often met with a lot of questions about how you determined that conclusion. In essence, all the hard work you did to simplify your statements is undone because someone wants you to justify why you go to that conclusion. You know, the kinds of details you would have given in a presentation if you’d been given the time to explain!

Solution: Five Minute Meetings

Okay, so I know I’m going to get flack for this one. Everyone has the solution the meeting overload problem. Standup meetings, team catch ups, some other kind of crazy treadmill conference calls. But the real way to reduce your meeting stress is to show people how valuable time is for everyone. Not just them.

My solution: All meetings with CxO level people are now five minutes long. Period. End of story. You get to walk in, give your statement, and you walk out. No questions. No long Q&A. Just your conclusions. You say what you have to say and move on.

Sounds stupid, doesn’t it? And that’s kind of the point. When you are forced to boil your premise down to something like the Dilbert smiley face above, you’re doing yourself a disservice. All the detail and nuance goes right out the window. The only way you get to bring it back out is if someone in the room starts asking questions. And if you don’t give enough detail they almost always will. Which defeats the purpose of boiling it down in the first place!

Instead, push it back on the CxOs with the most valuable time. Make them see how hard you work. By refusing to answer any of their follow up questions. You see, if their time is so valuable, you need to show them how much you respect it. If they have follow up questions or require more details, they need to write all those interrogatories down in an email or an action item list and send it to you so you can get it done on your time. Make them wait for the answers. Because then they’ll see that this idea that their time is valuable is just an illusion.

It sounds awfully presumptuous of me to say that we need to waste the time the C-level suite. But a little bit of pushback goes a long way. Imagine how furious they’ll be when you walk out of the meeting after five minutes and don’t answer a single question. How dare this knowledge worker not bend their calendar to my desire to learn more?!? It’s ridiculous!

How about wondering how ridiculous it is for this person to limit your time? Or to not know anything ahead of time about the topic of discussion? Imagine telling someone to wait until you’re ready to talk to them after a meeting starts because you are more important than they are! The nerve!

However, once you stick to your plan a few times the people in the room will understand that meetings about topics should be as long as they need to be. And you should be given enough time to explain up front instead of talking for five minutes and getting interrupted with a thousand questions that you were prepared to answer anyway if they’d just given you the chance to present!

Watch how your meetings transform from interrogation scenes to actual presentations with discussions. Instead of only getting five minutes to talk you’ll be accorded all the time you need to fill in the details. Maybe you only needed ten minutes in the first place. But the idea is that now your time and expertise is just as valuable as everyone else on the team, from the bottom all the way to the top.


Tom’s Take

There needs to be an obligatory “no everyone is like this” disclaimer. I’ve met some very accommodating executives. And I’ve also met some knowledge workers that can’t present their way out of a paper bag. But the way to fix those issues is to make them get better at giving info and at listening to presentations. The way is not to artificially limit time to make yourself seem more important. When you give your people the time they need to get you the info you need, you’ll find that they are going to answer the questions you have a lot quicker than waiting with dread as the CEO takes the time to think about what they were going to be told anyway.

AI and Trivia

questions answers signage

Photo by Pixabay on Pexels.com

I didn’t get a chance to attend Networking Field Day Exclusive at Juniper NXTWORK 2019 this year but I did get to catch some of the great live videos that were recorded and posted here. Mist, now a Juniper Company, did a great job of talking about how they’re going to be extending their AI-driven networking into the realm of wired networking. They’ve been using their AI virtual assistant, named “Marvis”, for quite a while now to solve basic wireless issues for admins and engineers. With the technology moving toward the copper side of the house, I wanted to talk a bit about why this is important for the sanity of people everywhere.

Finding the Answer

Network and wireless engineers are walking storehouses of useless trivia knowledge. I know this because I am one. I remember the hello and dead timers for OSPF on NBMA networks. I remember how long it takes BGP to converge or what the default spanning tree bridge priority is for a switch. Where some of my friends can remember the batting average for all first basemen in the league in 1971, I can instead tell you all about LSA types and the magical EIGRP equation.

Why do we memorize this stuff? We live in a world with instant search at our fingertips. We can find anything we might need thanks to the omnipotent Google Search Box. As long as we can avoid sponsored results and ads we can find the answer to our question relatively quickly. So why do we require people to memorize esoteric trivia? Is it so we can win free drinks at the bar after we’re done troubleshooting?

The problem isn’t that we have to know the answer. It’s that we need to know the answer in order to ask the right question. More often than now we find ourselves stuck in the initial phase of figuring out the problem. The results are almost always the same – things aren’t working. Finding the cause isn’t always easy though. We have to find some nugget of information to latch onto in order to start the process.

One of my old favorites was trying to figure out why a network I was working with had a segmented spanning tree. One side of the network was working just fine but there were three switches daisy chained together that didn’t. Investigations turned up very little. Google searches were failing me. It wasn’t until I keyed in on a couple of differences that I found out that I had improperly used a BPDU filtering command because of a scoping issue. Sure, it only took me two hours of searching to find it after I discovered the problem. But if I hadn’t memorized the BDPU filtering and guard commands and their behavior I wouldn’t have even known to ask about them. So it’s super important to know how every minutia of every protocol works, right?

Presenting the Right Questions

Not exactly. We, as human computers, memorize the answers to more efficiently search through our database to find the right answers. If the problem takes 5 minutes to present we can eliminate a bunch of causes. If it’s happening in layer 3 and not layer 2 we can toss out a bunch of other stuff. Our knowledge is allowing us to discard useless possibilities and focus on the right result.

And it’s horribly inefficient. I can attest to that given my various attempts to learn OSPF hello and dead timers through osmosis of falling asleep in my big CCNP Routing book. The answers don’t crawl off the page and into your brain no matter how loudly you snore into it. So I spent hours learning something that I might use two or three times in my career. There has to be a better way.

Not coincidentally, that’s where the AI-driven systems from Mist, and now Juniper, come into play. Marvis is wonderful at looking at symptoms and finding potential causes. It’s what we do as humans. Except Marvis has no inherent biases. It also doesn’t misremember the values for a given protocol or get confused about whether or not OSPF point-to-point networks are broadcast or not. Marvis just knows what it was programmed with. But it does learn.

Learning is the key to how these AI and machine learning (ML) driven systems have to operate. People tend to discount solutions because they think there’s no way it could be that solution this time. For example, a haiku:

It’s not DNS.
Could it be DNS?
It was DNS.

DNS is often the cause of our problems even if we usually discount it out of hand in the first five minutes of troubleshooting. Even if it was only DNS 50% of the time we would still toss DNS as the root cause within the first five minutes because we’ve “trained” our brains to know what a DNS problem looks like without realizing how many things DNS can really affect.

But AI and ML don’t make these false correlations. Instead, they learn every time what the cause was. They can look at the network and see the failure state, present options based on the symptoms, and even if you don’t check in your changes they can analyze the network and figure out what change caused everything to start working again. Now, the next time the problem crops up, a system like Marvis can present you with a list of potential solutions with confidence levels. If DNS is at the top of the list, you might want to look into DNS first.

AI is going to make us all better troubleshooters because it’s going to make us all less reliant on poor memory. Instead of misremembering how a protocol should be configure, AI and ML will tell us how it should look. If something is causing routing loops or if layer 2 issues are happening because of duplex mismatches we’ll be able to see that quickly and have confidence it’s the right answer instead of just guessing and throwing things at the wall until they stick. Just like Google has supplanted the Cliff Claven people at the bar that are storehouses of useless knowledge, so too will AI and ML reduce our dependence on know-it-alls that may not have all the answers.


Tom’s Take

I’m ready to be forgetful. I’m tired of playing “stump the chump” in troubleshooting with the network playing the part of the stumper and me playing the chump. I’ve memorized more useless knowledge than I ever care to recall in my life. But I don’t want to have to do the work any longer. Instead, I want to apply my gifts to training algorithms with more processing power than me to do all the heavy lifting. I’m more than happy to look at DNS and EIGRP timers than try to remember if MTU and reliability are part of the K-values for this network.

The Future of Hidden Features

 

948AD2EC-79D1-4828-AF55-C71EA8715771You may have noticed last week that Ubiquiti added a new “feature” to their devices in a firmware updated. According to this YouTube video from @TomLawrenceTech, Ubiquiti built an new service that contacts a URL to “phone home” and check in with their servers. It got some heavy discussion going, especially on Reddit.

The consensus is that Ubiquiti screwed up here by not informing people they were adding the feature up front and also not allowing users to opt-out initially. The support people at Ubiquiti even posted a quick workaround of blocking the URL at a perimeter firewall to prevent the communications until they could patch in the option to opt-out. If this was an isolated incident I could see some manner of outcry about it, but the fact of the matter is that companies are adding these hidden features more and more every day.

The first issue comes from the fact that most release notes for apps any more are nothing aside from platitudes. “Hey, we fixed some bugs and stuff so turn on automatic updates so you get the best version of our stuff!” is somewhat common now when it comes to a list of changes. That has a lot to do with applications developers doing unannounced A/B testing with people. It’s not uncommon to have two identical version numbers of an app running two wildly different code releases or having dissimilar UIs just because some bean counter wants to know how well Feature X polls with a certain demographic.

Slipstreaming Stuff

While that’s all well and good for consumer applications, the trend is starting to seep into enterprise software as well. While we still get an exhaustive list of things that have been fixed in a release, we’re getting more than we bargained for on occasion as well. Doing a scrub of code to ensure that it actually fixes the bugs that you have in your environment is important for reliability purposes. When the quality of the code you’re trying to publish is of less-than-stellar quality, you have have to spend more time ensuring that the stated fixes aren’t going to cause issues during implementation.

But what about when the stated features aren’t the only things included? Could you imagine the nightmare of installing a piece of software to fix an issue only to find out that something that was hidden in the code, like a completely undocumented feature, caused some other issue elsewhere? And when I say “undocumented feature” I’m not using it as a euphemism for another bug. I’m talking about a service that got installed that no one knows about. Like a piece of an app that will be enabled at a later time for instance.

Remember Microsoft back in the early 90s? They were invincible. They had everything they wanted in the palm of their hand. And then their world exploded. They got sued by the US government in an anti-trust case. One of the things that came out of this lawsuit was a focus on undocumented features in Microsoft software. The government said that Microsoft was adding features to their application software that gave them performance advantages over other vendors. And only they knew about them.

Microsoft agreed to get rid of all undocumented features, including some of the Easter eggs that were hidden in their software. That irritated some people that enjoyed finding these fun little gifts. But in 2005, there was a great blog post that discussed why Microsoft has a strict “no Easter eggs” policy now. And reading it almost 15 years later makes it sound so prescient.

Security Sounds Simple, Right?

Undocumented features are a security risk. If there is code in your app that isn’t executing or hasn’t been completely checked against the interactions in your environment you’re adding a huge amount of risk. There can be interactions that someone doesn’t understand or that you have no way of seeing. And you can better believe that if the regulators of your industry find them before you do you’re going to have a lot of explaining to do.

That’s even if you notice it in the first place. How many times have you been told to whitelist a specific URL or IP stack to make something work? I can remember being told to whitelist 17.0.0.0/8 for Apple to make their push notification service work. That’s an awful lot of IP addresses to enable push notifications! And that’s a lot of data that could be corrupted or misused by a smart threat actor.

The solution we need to fix the problem is actually pretty simple. We need to push back against the idea that we can slip undocumented features into updates. Release notes need to list everyting that comes in the software package. We need to tell our vendors and companies that we have to have a full listing of the software features. And regulatory bodies need to be ready to share the blame when someone breaks those rules instead of punishing the people that had no idea there was something in the code that was misbehaving.


Tom’s Take

I’m not a fan of finding things I wasn’t expecting. At one point my wife and I had the latest Facebook app updates and somehow her UI looked radically different than mine. But if it’s on my phone or my home computer I don’t really have much to complain about. Finding undocumented apps and features in my enterprise software is a huge issue though. Security is paramount and undocumented code is an entry point to disaster. We are the only ones that can stem the tide by pushing back against this practice now before it becomes commonplace in the future.

In Defense of Support

We’re all in IT. We’ve done our time in the trenches. We’ve…seen things, as Roy Batty might say. Things you wouldn’t believe. But in the end we all know the pain of trying to get support for something that we’re working on. And we know how painful that whole process can be. Yet, how is it that support is universally “bad” in our eyes?

One Of Us

Before we launch into this discussion, I’ll give you a bit of background on me. I did inbound tech support for Gateway Computers for about six months at the start of my career. So I wasn’t supporting enterprises to start with but I’ve been about as far down in the trenches as you can go. And that taught me a lot about the landscape of support.

The first thing you have to realize is that most Tier 1 support people are, in fact, not IT nerds. They don’t have a degree in troubleshooting OSPF or are signatories to the fibre channel standards. They are generally regular people. They get a week or two of training and off they go. In general the people on the other end of the support phone number are better phone people than they are technical people. The trainers at my old job told me, “We can teach someone to be technical. But it’s hard to find someone who is pleasant on the phone.”

I don’t necessarily agree with that statement but it seems to be the way that most companies staff their support lines. Why? Ask yourself a quick question: How many times has Tier 1 support solved your issue? For most of us the answer is probably “never” or “extremely rarely”. That’s because we, as IT professionals, have spent a large amount of our time and energy doing the job of a Tier 1 troubleshooter already. We pull device logs and look at errors and Google every message we can find until we hit a roadblock. Then it’s time to call. However, you’re already well past what Tier 1 can offer.

Look at it from the perspective of the company offering the support. If the majority of people calling me are already past the point of where my Tier 1 people can help them, why should I invest in those people? If all they’re going to do is take information down and relay the call to Tier 2 support how much knowledge do they really need? Why hire a rock star for a level of support that will never solve a problem?

In fact, this is basically the job of Tier 1 support. If it’s not a known issue, like an outage or a massive bug that is plastered up everywhere, they’re going to collect the basic information and get ready to pass the call to the next tier of support. Yes, it sucks to have to confirm serial numbers and warranty status with someone that knows less about VLANs than you do. But if you were in the shoes of the Tier 2 technician would you want to waste 10 minutes of the call verifying all that info yourself? Or would you rather put your talent to use to get the problem solved?

Think about all the channel partners and certification benefits that let you bypass Tier 1 support. They’re all focused on the idea that you know enough about the product to get to the next level to help you troubleshoot. But they’re also implying that you know the customer’s device is in warranty and that you need to pull log files and other data before you open a ticket so there’s no wasted time. But you still need to take the time to get the information to the right people, right? Would you want to start troubleshooting a problem with no log files and only a very basic description of the issue? And consider that half the people out there just put “Doesn’t Work” or “Acting Weird” for the problem instead of something more specific.

Sight Beyond Sight

This whole mess of gathering info ahead of time is one of the reasons why I’m excited about the coming of network telemetry and network automation. That’s because you now have access to all the data you need to send along to support and you don’t need to remember to collect it. If you’ve ever logged into a switch and run the show tech-support command you probably recall with horror the console spam that was spit out for minutes upon minutes. Worse yet, you probably remember that you forgot to redirect that spam to a file on the system to capture it all to send along with your ticket request.

Commands like show tech-support are the kinds of things that network telemetry is going to solve. They’re all-in-one monsters that provide way too much data to Tier 2 support techs. If the data they need is in there somewhere it’s better to capture it all and spend hours poring through a text file than talk to the customer about the specific issue, right?

Now, imagine the alternative. A system that notices a ticket has been created and pulls the appropriate log files and telemetry. The system adds the important data to the ticket and gets it ready to send to the vendor. Now, instead of needing to provide all the data to the Tier 1 folks you can instead open a ticket at the right level to get someone working on the problem. Mean Time to Resolution goes down and everyone is happy. Yes, that does mean that there are going to be fewer Tier 1 techs taking calls from people that don’t have a way to collect the data ahead of time but that’s the issue with automating a task away from someone. If the truth be told they would be able to find a different job doing the same thing in a different industry. Or they would see the value of learning how to troubleshoot and move up the ladder to Tier 2.

However, having telemetry gathered automatically still isn’t enough for some. The future is removing the people from the chain completely. If the system can gather telemetry and open tickets with the vendor what’s stopping it from doing the same proactively? We already have systems designed to do things like mail hard disk drives to customers when the storage array is degraded and a drive needs to be replaced. Why can’t we extend things like that to other parts of IT? What if Tier 2 support called you for a change and said, “Hey, we noticed your routing protocol has a big convergence issue in the Wyoming office. We think we have a fix so let’s remote in and fix it together.” Imagine that!


Tom’s Take

The future isn’t about making jobs go away. It’s about making everything more efficient. I don’t want to see Tier 1 support people lose their jobs but I also know that most people consider their jobs pointless already. No one wants to deal with the info takers and call routers. So we need to find a way to get the right information to the people that need it with a minimum of effort. That’s how you make support work for the IT people again. And when that day comes and technology has caught up to the way that we use support, perhaps we won’t need to defend them anymore.