About networkingnerd

Tom Hollingsworth, CCIE #29213, is a former network engineer and current organizer for Tech Field Day. Tom has been in the IT industry since 2002, and has been a nerd since he first drew breath.

Fast Friday Thoughts From Security Field Day

It’s a busy week for me thanks to Security Field Day but I didn’t want to leave you without some thoughts that have popped up this week from the discussions we’ve been having. Security is one of those topics that creates a lot of thought-provoking ideas and makes you seriously wonder if you’re doing it right all the time.

  • Never underestimate the value of having plumbing that connects all your systems. You may look at a solution and think to yourself “All this does is aggregate data from other sources”. Which raises the question: How do you do it now? Sure, antivirus fires alerts like a car alarm. But when you get breached and find out that those alerts caught it weeks ago you’re going to wish you had a better idea of what was going on. You need a way to send that data somewhere to be dealt with and cataloged properly. This is one of the biggest reasons why machine learning is being applied to the massive amount of data we gather in security. Having an algorithm working to find the important pieces means you don’t miss things that are important to you.
  • Not every solution is going to solve every problem you have. My dishwasher does a horrible job of washing my clothes or vacuuming my carpets. Is it the fault of the dishwasher? Or is it my issue with defining the problem? We need to scope our issues and our solutions appropriately. Just because my kitchen knives can open a package in a pinch doesn’t mean that the makers need to include package-opening features in a future release because I use them exclusively for that purpose. Once we start wanting the vendors to build a one-stop-shop kind of solution we’re going to create the kind of technical debt that we need to avoid. We also need to remember to scope problems so that they’re solvable. Postulating that there are corner cases with no clear answers are important for threat hunting or policy creation. Not so great when shopping through a catalog of software.
  • Every term in every industry is going to have a different definition based on who is using it. A knife to me is either a tool used on a campout or a tool used in a kitchen. Others see a knife as a tool for spreading butter or even doing surgery. It’s a matter of perspective. You need to make sure people know the perspective you’re coming from before you decide that the tool isn’t going to work properly. I try my best to put myself in the shoes of others when I’m evaluating solutions or use cases. Just because I don’t use something in a certain way doesn’t mean it can’t be used that way. And my environment is different from everyone else’s. Which means best practices are really just recommended suggestions.
  • Whatever acronym you’ve tied yourself to this week is going to change next week because there’s a new definition of what you should be doing according to some expert out there. Don’t build your practice on whatever is hot in the market. Build it on what you need to accomplish and incorporate elements of new things into what you’re doing. The story of people ripping and replacing working platforms because of an analyst suggestion sounds horrible but happens more often than we’d like to admit. Trust your people, not the brochures.

Tom’s Take

Security changes faster than any area that I’ve seen. Cloud is practically a glacier compare to EPP, XDR, and SOPV. I could even make up an acronym and throw it on that list and you might not even notice. You have to stay current but you also have to trust that you’re doing all you can. Breaches are going to happen no matter what you do. You have to hope you’ve done your best and that you can contain the damage. Remember that good security comes from asking the right questions instead of just plugging tools into the mix to solve issues you don’t have.

Choosing the Least Incorrect Answer

My son was complaining to me the other day that he missed on question on a multiple choice quiz in his class and he got a low B grade instead of getting a perfect score. When I asked him why he was frustrated he told me, “Because it was easy and I missed it. But I think the question was wrong.” As usual, I pressed him further to explain his reasoning and found out that the question was indeed ambiguous but the answer choices were pretty obviously wrong all over. He asked me why someone would write a test like that. Which is how he got a big lesson on writing test questions.

Spin the Wheel

When you write a multiple choice test question for any reputable exam you are supposed to pick “wrong” answers, known as distractors, that ensure that the candidate doesn’t have a better than 25% chance of guessing the correct answer. You’ve probably seen this before because you took some kind of simple quiz that had answers that were completely wrong to the point of being easy to pick out. Those quizzes are usually designed to be passed with the minimum amount of effort.

This also extends to a question that includes answer choices that are paired. If you write a question that says “pick the three best answers” with six options that are binary pairs you’re basically saying to the candidate “Pick between these two three times and you’re probably going to get it right”. I’ve seen a number of these kinds of questions over the years and it feels like a shortcut to getting one on the house.

The most devious questions come from the math side of the house. Some of my friends have been known to write questions for their math tests and purposely work the problem wrong at a critical point to get a distractor that looks very plausible. You make the same mistake and you’re going to see the correct answer in the choices and get it wrong. The extra effort here matters because if you see too many students getting the same wrong distractor as the answer you know that there may be confusion about the process at that critical point. Also, the effort to make math question distractors look plausible is impressive and way too time consuming.

Why Is It Wrong?

Compelling distractors are a requirement for any sufficiently advanced testing platform. The professionals that write the tests understand that guessing your way through a multiple choice exam is a bad precedent and the whole format needs to be fair. The secret to getting the leg up on these exams is more than just knowing the right answer. It’s about knowing why things are wrong.

Take an easy example: OSPF LSAs. A question may ask you about a particular router in a diagram and ask you which LSAs that it sees. If the answer choices are fairly configured you’re going to be faced with some plausible looking answers. Say the question is about a not-so-stubby-area (NSSA). If you know the specifics of what makes this area unique you can start eliminating choices from the question. What if it’s asking about which LSAs are not allowed? Well, if you forgot the answer to that you can start by reading the answer choices and applying logic.

You can usually improve your chances of getting a question right by figuring out why the answers given are wrong for the question. In the above example, if LSA Type 1 is listed as an answer choice ask yourself “Why is this the wrong answer?” For the question about disallowed LSA types you can eliminate this choice because LSA Type 1 is always present inside an area. For a question about visibility of that LSA outside of an area you’d be asking a different question. But if you know that Type 1 LSAs are local and always visible you can cross off that as a potential answer. That means you boosted your chances of guessing the answer to 33%!

The question itself is easy if you know that NSSAs use Type 7 LSAs to convey information because Type 5 LSAs aren’t allowed. But if you understand why the other answers are wrong for the question asked you can also check your work. Why would you want to do that? Because the wording of the question can trip you up. How many times have you skimmed the question looking for keywords and missing things like “not” or “except”? If you work the question backwards looking for why answers are wrong and you keep coming up with them being right you may have read the question incorrectly in the first place. Likewise, if every answer is wrong somehow you may have a bad question on your hands.

What happens if the question is poorly worded and all the answer choices are wrong? Well, that’s when you get to pick the least incorrect answer and leave feedback. It’s not about picking the perfect answer in these situations. You have to know that a lot of hands touch test questions and there are times when things are rewritten and the intent can be changed somehow. If you know that you are dealing with a question that is ambiguous or flat-out wrong you should leave feedback in the question comments so it can be corrected. But you still have to answer the question. So, use the above method to find the piece that is the least incorrect and go with that choice. It may not be “right” according to the test question writer, but if enough people pick that answer you’re going to see someone taking a hard look at the question.


Tom’s Take

We are going to take a lot of tests in our lives. Multiple choice tests are easier but require lots of work, both on the part of the writer and the taker. It’s not enough to just memorize what the correct answers are going to be. If you study hard and understand why the distractors are incorrect you’ll have a more complete understanding of the material and you’ll be able to check your work as you go along. Given that most certification exams don’t allow you to go back and change answers once you’ve moved past the question the ability to check yourself in real time gives you an advantage that can mean the difference between passing and retaking the exam. And that same approach can help you when everything on the page looks wrong.

What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

Chip Shortages Aren’t Sweet for Networking

Have you tried to order networking gear recently? You’re probably cursing because the lead times on most everything are getting long. It’s not uncommon to see lead times on wireless access points or switch gear reaching 180 days or more. Reports from the Internet say that some people are still waiting to get things they ordered this spring. The prospect of rapid delivery of equipment is fading like the summer sun.

Why are we here? What happened? And can we do anything about it?

Fewer Chips, More Air

The pandemic has obviously had the biggest impact for a number of reasons. When a fabrication facility shuts down it doesn’t just ramp back up. Even when all the workers are healthy and the city where it is located is open for business it takes weeks to bring everything back online to full capacity. Just like any manufacturing facility you can’t just snap your fingers and get back to churning out the widgets.

The pandemic has also strained supply chains around the world. Even if the fabs had stayed open this entire time you’d be looking at a shortage of materials to make the equipment. Global supply chains were running extremely lean in 2019 and exposing one aspect of them has created a cascade effect that has caused stress everywhere. The lack of toilet paper or lunchmeat in your grocery store shows that. Even when the supply is available the ability to deliver it is impacted.

The supply chain problem also belies the issue on the other side of the shipping container. Even if the fabs had enough chips to sell to anyone that wanted them it’s hard to get those parts delivered to the companies that make things. If this were simply an issue of a company not getting the materials it needed to make a widget in a reasonable time there wouldn’t be as much issue. But because these companies make things that other companies use to make things the hiccups in the chain are exacerbated. If TSMC is delayed by a month getting a run of chips out, that month-long delay only increases for those down the line.

We’ve got issues getting facilities back online. We’ve got supply chains causing problems all over the place. Simple economics says we should just build more facilities, right? The opportunity costs of not having enough production around means we have ample space to make more of the things we need and profit. You’re right. Companies like Intel are bringing new fabs online as fast as they can. Sadly, that is a process that is measured in months or even years. The capacity we need to offset the disruption to the chip market should have been built two years ago if we wanted it ready now.

All of these factors are mixed into one simple truth. Without the materials, manufacturing, or supply chain to deliver the equipment we’re going to be left out in the cold if we want something delivered today. Just in Time inventory is about to become Somewhere in Time inventory. We’re powerless to change the supply chain. Does that means we’re powerless to prevent disruption to our planning process?

Proactive Processes

We may not be able to assemble networking gear ourselves to speed up the process but we are far from helpless. The process and the planning around gear acquisition and deployment has to change to reflect the current state of the world. We can have an impact provided we’re ready to lead by example.

  • Procure NOW: Purchasing departments are notorious for waiting until the last minute to buy things. Part of that reasoning is that expenditures are worth less in the future than right now because those assets are more valuable today gaining interest or something. You need to go to the purchasing department and educate them about how things are working right now. Instead of them sitting on the project for another few months you need to tell them that the parts have to be ordered right now in order for them to be delivered in six or seven months. They’re going to fight you and tell you that they can just wait. However, we all know this isn’t going to clear up any time soon. If they persist in trying to tell you that you need to wait just have them try to go car shopping to illustrate the issue. If you want stuff by the end of Q1 2022, you need to get that order in NOW.
  • Preconfigure Things However You Can: If you’re stuck waiting six months to get switches and access points, are you going to be stuck waiting another month after they come in to configure them? I hope that answer is a resounding “NO”. There are resources available to make sure you can get things configured now so you’re not waiting when the equipment is sitting on a loading dock somewhere. You need to reach out to your VAR or your vendor and get some time on lab gear in the interim. If you ordered a wireless controller or a data center switch you can probably get some rack time on a very similar device or even the exact same one in a lab somewhere. That means you can work on a basic configuration or even provision things like VLANs or SSIDs so you’re not recreating the wheel when things come in. Even if all you have is a skeleton config you’re hours ahead of where you would be otherwise. And if the VAR or vendor gives you a hard time about lab gear you can always remind them that there are other options available for the next product refresh.
  • Minimum Viable Functionality: All this advice is great for a new pod or an addition to an existing network that isn’t critical. What if the gear you ordered is needed right now? What if this project can’t wait? How can you make things work today with nothing in hand? This is a bit trickier because it will require duplicate work. If you need to get things operational today you need to work with what you have today. That means you may have to salvage an old lab switch or pull something out of production and reduce available ports until the gear can arrive. It also means you’re going to have to backup the old configs, erase them completely (don’t forget about the VLAN database and VTP server configurations), and then put on the new info. When the new equipment comes in you’re going to have to do it all over again in reverse. It’s more work but it leads to things being operational today instead of constantly telling someone that it’s going to be a while. If you’re a VAR that’s doing this for a customer, you’d better make it very clear this is temporary and just a loan. Otherwise you might find your equipment being a permanent addition even after everything comes in.

Tom’s Take

The chip shortage is one of those things that’s going to linger under the best of circumstances. We’re going to be pressed to get gear in well into 2022. That means delayed projects and lots of arguing about what’s critical and what’s not. We can’t fix the semiconductor sector of the market but we can work to make sure that the impact felt there is the only one that impacts us right now. The more we do ahead of time to make things smooth the better it will be when it’s finally time to make things happen. Don’t let the lack of planning on the part of the supply chain sour your outlook on doing your role in networking.

Private 5G Needs Complexity To Thrive

I know we talk about the subject of private 5G a lot in the industry but there are more players coming out every day looking to add their voice to the growing supporters of these solutions. And despite the fact that we tend to see 5G and Wi-Fi technologies as ships in the night this discussion isn’t going to go away any time soon. In part it’s because decision makers aren’t quite savvy enough to distinguish between the bands, thinking all wireless communications are pretty much the same.

I think we’re not going to see much overlap between these two technologies. But the reasons why aren’t quite what you might think.

Walking Workforces

Working from anywhere other than the traditional office is here to stay. Every major Silicon Valley company has looked at the cost benefit analysis and decided to let workers do their thing from where they live. How can I tell it’s permanent? Because they’re reducing salaries for those that choose to stay away from the Bay Area. That carrot is pretty enticing and for the companies to say that it’s not on the table for remote work going forward means they have no incentive to make people want to move to work from an office.

Mobile workers don’t care about how they connect. As long as they can get online they are able to get things done. They are the prime use case for 5G and Private 5G deployments. Who cares about the Wi-Fi at a coffee shop if you’ve got fast connectivity built in to your mobile phone or tablet? Moreover, I can also see a few of the more heavily regulated companies requiring you to use a 5G uplink to connect to sensitive data though a VPN or other technology. It eliminates some of the issues with wireless protection methods and ensures that no one can easily snoop on what you’re sending.

Mobile workers will start to demand 5G in their devices. It’s a no-brainer for it to be in the phone and the tablet. As laptops go it’s a smart decision at some point, provided enough people have swapped over to using tablets by then. I use my laptop every day when I work but I’m finding myself turning to my iPad more and more. Not for any magical reason but because it’s convenient if I want to work from somewhere other than my desk. I think that when laptops hit a wall from a performance standpoint you’re going to see a lot of manufacturers start to include 5G as a connection option to lure people back to them instead of abandoning them to the big tablet competition.

However, 5G is really only a killer technology for these more complex devices. The cost of a 5G radio isn’t inconsequential to the overall cost of a device. After all, Apple raised the price of their iPad when they included a 5G radio, didn’t they? You could argue that they didn’t when they upgraded the iPhone to a 5G chipset but the cellular technology is much more integral to the iPhone than the iPad. As companies examine how they are going to move forward with their radio technology it only makes sense to put the 5G radios in things that have ample space, appropriate power, and the ability to recover the costs of including the chips. It’s going to be much more powerful but it’s also going to be a bigger portion of the bill of materials for the device. Higher selling prices and higher margins are the order of the day in that market.

Reassuringly Expensive IoT

One of the drivers for private 5G that I’ve heard of recently is the drive to have IoT sensors connected over the protocol. The thinking goes that the number of devices that are going to be deployed it going to create a significant amount of traffic in a dense area that is going to require the controls present in 5G to ensure they aren’t creating issues. I would tend to agree but with a huge caveat.

The IoT sensors that people are talking about here aren’t the ones that you might think of in the consumer space. For whatever reason people tend to assume IoT is a thermostat or a small device that does simple work. That’s not the case here. These IoT devices aren’t things that you’re going to be buying one or two at a time. They are sensors connected to a larger system. Think HVAC relays and probes. Think lighting sensors or other environmental tech. You know what comes along with that kind of hardware? Monitoring. Maintenance. Subscription costs.

The IoT that is going to take advantage of private 5G isn’t something you’re going to be deploying yourself. Instead, it’s going to be something that you partner with another organization to deploy. You might “own” the tech in the sense that you control the data but you aren’t going to be the one going out to Best Buy or Tech Data to order a spare. Instead, you’re going to pay someone to deploy it and it when it goes wrong. So how does that differ from the IoT thermostat that comes to mind? Price. Those sensors are several hundred dollars each. You’re paying for the technology included in them with that monthly fee to monitor and maintain them. They will talk to the radio station in the building or somewhere nearby and relay that data back to your dashboard. Perhaps it’s on-site or, more likely, in a cloud instance somewhere. All those fees mean that the devices become more complex and can absorb the cost of more complicated radio technology.

What About Wireless?

Remember when wireless was something cool that you had to show off to people that bought a brand new laptop? Or the thrill of seeing your first iPhone connect to attwifi at Starbucks instead of using that data plan you paid so dearly to get? Wireless isn’t cool any more. Yes, it’s faster. Yes, it is the new edge of our world. But it’s not cool. In the same way that Ethernet isn’t cool. Or web browsers aren’t cool. Or the internal combustion engine isn’t cool. Wi-Fi isn’t cool any more because it is necessary. You couldn’t open an office today without having some form of wireless communications. Even if you tried I’m sure that someone would hop over to the nearest big box store and buy a consumer-grade router to get wireless working before the paint was even dry on the walls.

We shouldn’t think about private 5G replacing Wi-Fi because it never will. There will be use cases where 5G makes much more sense, like in high-density deployments or in areas were the contention in the wireless spectrum is just too great to make effective use of it. However, not deploying Wi-Fi in favor of deploying private 5G is a mistake. Wireless is the perfect “set it and forget it” technology. Provide an SSID for people to connect to and then let them go crazy. Public venues are going to rely on Wi-Fi for the rest of time. These places don’t have the kind of staff necessary to make private 5G economical in the long run.

Instead, think of private 5G deployments more like the way that Wi-Fi used to be. It’s an option for devices that need to be managed and controlled by the organization. They need to be provisioned. They need to consume cycles to operate properly. They need to be owned by the company and not the employee. Private 5G is more of a play for infrastructure. Wi-Fi is the default medium given the wide adoption it has today. It may not be the coolest way to connect to the network but it’s the one you can be sure is up and running without the need for the IT department to come down and make it work for you.


Tom’s Take

I’ll admit that the idea of private 5G makes me smile some days. I wish I had some kind of base station here at my house to counteract the horrible reception that I get. However, as long as my Internet connection is stable I have enough wireless coverage in the house to make the devices I have work properly. Private 5G isn’t something that is going to displace the installed base of Wi-Fi devices out there. With the amount of management that 5G requires in devices you’re not going to see a cheap or simple method to deploying it appear any time soon. The pie-in-the-sky vision of having pervasive low power deployments in cheap devices is not going to be realistic on the near future horizon. Instead, think of private 5G as something that you need to use when your other methods won’t work or when someone you are partnering with to deploy new technology requires it. That way you won’t be caught off-guard when the complexity of the technology comes to play.

APIs and Department Stores

This week I tweeted something from a discussion we had during Networking Field Day that summed up my feelings about the state of documentation of application programming interfaces (APIs):

I laughed a bit as I wrote it because I’ve worked in department stores like Walmart in the past and I know the reasons why they tend to move things around. Comparing that to the way that APIs are documented is an interesting exercise in how people think about things like new capabilities and notification of changes.

Branding Exercises

In case you weren’t aware, everything in your average department store is carefully planned out. The things placed in the main aisles are decided on weeks in advance due to high traffic. The items placed at the ends of the aisles, or endcaps, are placed there to highlight high margin items or things that are popular enough to be sought out by customers. The makeup of the rest of the store is determined by a lot of metrics.

There are a few restrictions that have to be taken into account. In department stores with grocery departments, the locations of the refrigerated sections must be around the outside because of power requirement. Within those restrictions, plans put the high traffic items in the back of the store to require everyone to walk past all the other stuff in hopes they might buy it. That’s why the milk and bread and electronics areas are always the furthest away from the front of the store. You’re likely headed there anyway so why not make you work for it?

Every few months the store employees receive new floor plans that move items to different locations. Why would they do that? Well, those metrics help them understand where people are more likely to purchase certain items. Those metrics also tell the planners what items should be located together as well, which is how the whole aisle is planned out. Once everything gets moved they start gathering new metrics and find out how well their planning works. Aside from the inevitable grumbles. Even with some fair warning no one is happy when you find out something has moved.

Who Needs Documentation?

You might think that, on the surface, there’s not much similarity between a department store aisle and an API. One is a fixture. The other is code. Yet, think about how APIs are typically changed and you might find some of the parallels. Change is a constant in the world of software development, after all.

The APIs that we used a decade ago are almost assuredly different from the ones we program for today. Every year brings updated methods, new functions, and even changes in programming languages or access methods. How can you be sure that developers are accessing the latest and greatest technology that you’ve put into place? You can’t just ask them. Instead, you have to deprecate the methods that you don’t want them to use any longer.

Ask any developer writing for a API about deprecation and you’re probably going to hear a string of profanity. Spending time to write a perfectly good piece of software only to have it wrecked by someone’s decision to do things differently is infuriating to say the least. Trying to solve a hard problem with a novel concept is one thing. Having to do it all over again a month later when a new update is released is even more infuriating.

It’s the same fury that you feel when the peanut butter is moved from aisle four to aisle eight. How dare you! It took me a week last time to remember where it was and now you’ve gone and moved it. Just like when I spent all that time learning which methods to query to pull the data I needed for my applications.

No matter how much notice you give or how much you warn people that change is coming they’re always going to be irritated at you for making those changes. It feels like a waste of effort to need to rewrite an interface or to walk a little further in the store to locate the item you wanted. Humans aren’t fond of wasted effort or of needing to learn new things without good reason.

Poor API documentation is only partly to blame for this. Even the most poorly documented API will eventually be mapped out by someone that needs the info. It’s also the fact that the constant change in methods and protocols forces people to spend a significant amount of time learning the same things over and over again for very little gain.

The Light at the End of the Aisle

Ironically enough, both of these kinds of issues are likely to be solved in a similar way. Thanks to the large explosion of people doing their shopping online or with pickup and delivery services there is a huge need to have things more strictly documented and updated very frequently. It’s not enough to move the peanut butter to a better location. Now you need to update your online ordering system so the customers as well as the staff members pulling it for a pickup order can find it quickly and get more orders done in a shorter time.

Likewise, the vast number of programs that are relying on API calls today necessitate that older versions of functionality are supported for longer or newer functions are more rigorously tested before implementation. You don’t want to disable a huge section of your userbase because you deprecated something you didn’t like to maintain any longer. Unless you are the only application in the market you will find that creating chaos will just lead to users fleeing for someone that doesn’t upset their apple cart on a regular basis.


Tom’s Take

Documentation is key for us to understand change. We can’t just say we changed something. We have to give warning, ensure that people have seen the warning, tell them we’ve changed it, and then give them some way to transform the old way of things into the new one. And even that might not be enough. However, the pace of change that we’re seeing also means that rapid changes may not even be required for much longer. With people choosing to order online and never step foot inside the store the need to change the shelves frequently may be a thing of the past. With new methods and languages being developed so rapidly today it may be much faster to rewrite everyone on a new API and leave the old one intact instead of forcing developers to look at technology that is years old at this point. The delicious irony of the people forcing change on us to need to accept change themselves is something I’d happily shop for.

Fast Friday – Podcasts Galore!

It’s been a hectic week and I realized that I haven’t had a chance to share some of the latest stuff that I’ve been working on outside of Tech Field Day. I’ve been a guest on a couple of recent podcasts that I loved.

Art of Network Engineering

I was happy to be a guest on Episode 57 of the Art of Network Engineering podcast. AJ Murray invited me to take part with all the amazing co-hosts. We talked about some fun stuff including my CCIE study attempts, my journey through technology, and my role at Tech Field Day and how it came to be that I went from being a network engineer to an event lead.

The interplay between the hosts and I during the discussion was great. I felt like we probably could have gone another hour if we really wanted to. You should definitely take a listen and learn how I kept getting my butt kicked by the CCIE open-ended questions or what it’s like to be a technical person on a non-technical briefing.

IPv6, Wireless, and the Buzz

I love being able to record episodes of Tomversations on Youtube. One of my latest was all about IPv6 and Wi-Fi 6E. As soon as I hit the button to publish the episode I knew I was going to get a call from my friends over at the IPv6 Buzz podcast. Sure enough, I was able to record an episode talking to them all about how the parallels between the two technologies are similar in my mind.

What I love about this podcast is that these are the experts when it comes to IPv6. Ed and Tom and Scott are the people that I would talk to about IPv6 any day of the week. And having them challenge my assertions about what I’m seeing helps me understand the other side of the coin. Maybe the two aren’t as close as I might have thought at first but I promise you that the discussion is well worth your time.


Tom’s Take

I don’t have a regular podcast aside from Tomversations so I’m not as practiced in the art of discussion as the people above. Make sure you check out those episodes but also make sure to subscribe to the whole thing because you’re going to love all the episodes they record.

Getting Blasted by Backdoors

Open Door from http://viktoria-lyn.deviantart.com/

I wanted to take minute to talk about a story I’ve been following that’s had some new developments this week. You may have seen an article talking about a backdoor in Juniper equipment that caused some issues. The issue at hand is complicated at the linked article does a good job of explaining some of the nuance. Here’s the short version:

  • The NSA develops a version of Dual EC random number generation that includes a pretty substantial flaw.
  • That flaw? If you know the pseudorandom value used to start the process you can figure out the values, which means you can decrypt any traffic that uses the algorithm.
  • NIST proposes the use of Dual EC and makes it a requirement for vendors to be included on future work. Don’t support this one? You don’t get to even be considered.
  • Vendors adopt the standard per the requirement but don’t make it the default for some pretty obvious reasons.
  • Netscreen, a part of Juniper, does use Dual EC as part of their default setup.
  • The Chinese APT 5 hacking group figures out the vulnerability and breaks into Juniper to add code to Netscreen’s OS.
  • They use their own seed value, which allows them to decrypt packets being encrypted through the firewall.
  • Hilarity does not ensue and we spend the better part of a decade figuring out what has happened.

That any of this even came to light is impressive considering the government agencies involved have stonewalled reporters and it took a probe from a US Senator, Ron Wyden, to get as far as we have in the investigation.

Protecting Your Platform

My readers know that I’m a pretty staunch advocate for not weakening encryption. Backdoors and “special” keys for organizations that claim they need them are a horrible idea. The safest lock is one that can’t be bypassed. The best secret is one that no one knows about. Likewise, the best encryption algorithms are ones that can’t be reversed or calculated by anyone other than the people using them to send messages.

I get that the flood of encrypted communications today is making life difficult for law enforcement agencies all over the world. It’s tempting to make it a requirement to allow them a special code that will decrypt messages to keep us safe and secure. That’s the messaging I see every time a politician wants to compel a security company to create a vulnerability in their software just for them. It’s all about being safe.

Once you create that special key you’ve already lost. As we saw above, the intentions of creating a backdoor into an OS so that we could spy on other people using it backfired spectacularly. Once someone else figured out that you could guess the values and decrypt the traffic they set about doing it for themselves. I can only imagine the surprise at the NSA when they realized that someone had changed the values in the OS and that, while they themselves were no longer able to spy with impunity, someone else could be decrypting their communications at that very moment. If you make a key for a lock someone will figure out how to make a copy. It’s that simple.

We focus so much on the responsible use of these backdoors that we miss the bigger picture. Sure, maybe we can make it extremely difficult for someone in law enforcement to get the information needed to access the backdoor in the name of national security. But what about other nations? What about actors not tied to a political process or bound by oversight from the populace. I’m more scared that someone that actively wishes to do me harm could find a way to exploit something that I was told had to be there for my own safety.

The Juniper story gets worse the more we read into it but they were only the unlucky dancer with a musical chair to slip into when the music stopped. Any one of the other companies that were compelled to include Dual EC by government order could have gotten the short straw here. It’s one thing to create a known-bad version of software and hope that someone installs it. It’s an entirely different matter to force people to include it. I’m honestly shocked the government didn’t try to mandate that it must be used exclusively of other algorithms. In some other timeline Cisco or Palo Alto or even Fortinet are having very bad days unwinding what happened.


Tom’s Take

The easiest way to avoid having your software exploited is not to create your own exploit for it. Bugs happen. Strange things occur in development. Even the most powerful algorithms must eventually yield to Moore’s Law or Shor’s Algorithm. Why accelerate the process by cutting a master key? Why weaken yourself on purpose by repeating over and over again that this is “for the greater good”? Remember that the greater good may not include people that want the best for you. If you’re wiling to hand them a key to unlock the chaos that we’re seeing in this case then you have overestimated your value to the process and become the very bad actor you hoped to stop.

Sharing Failure as a Learning Model

Earlier this week there was a great tweet from my friends over at Juniper Networks about mistakes we’ve made in networking:

It got some interactions with the community, which is always nice, but it got me to thinking about how we solve problems and learn from our mistakes. I feel that we’ve reached a point where we’re learning from the things we’ve screwed up but we’re not passing it along like we used to.

Write It Down For the Future

Part of the reason why I started my blog was to capture ideas that had been floating in my head for a while. Troubleshooting steps or perhaps even ideas that I wanted to make sure I didn’t forget down the line. All of it was important to capture for the sake of posterity. After all, if you didn’t write it down did it even happen?

Along the way I found that the posts that got significant traction on my site were the ones that involved mistakes. Something I’d done that caused an issue or something I needed to look up through a lot of sources that I distilled down into an easy reference. These kinds of posts are the ones that fly right up to the top of the Google search results. They are how people know you. It could be a terminology post like defining trunks. Or perhaps it’s a question about why your SPF modules are working in a switch.

Once I realized that people loved finding posts that solved problems I made sure to write more of them down. If I found a weird error message I made sure to figure out what it was and then put it up for everyone to find. When I documented weird behaviors of BPDUGuard and BPDUFilter that didn’t match the documentation I wrote it all down, including how I’d made a mistake in the way that I interpreted things. It was just part of the experience for me. Documenting my failures and my learning process could help someone in the future. My hope was that someone in the future would find my post and learn from it like I had.

Chit Chat Channels

It used to be that when you Googled error messages you got lots of results from forum sites or Reddit or other blogs detailing what went wrong and how you fixed it. I assume that is because, just like me, people were doing their research and figuring out what went wrong and then documenting the process. Today I feel like a lot of that type of conversation is missing. I know it can’t have gone away permanently because all networking engineerings make mistakes and solve problems and someone has to know where that went, right?

The answer came to me when I read a Reddit post about networking message boards. The suggestions in the comments weren’t about places to go to learn more. Instead, they linked to Slack channels or Discord servers where people talk about networking. That answer made me realize why the discourse around problem solving and learning from mistakes seems to have vanished.

Slack and Discord are great tools for communication. They’re also very private. I’m not talking about gatekeeping or restrictions on joining. I’m talking about the fact that the conversations that happen there don’t get posted anywhere else. You can join, ask about a problem, get advice, try it, see it fail, try something else, and succeed all without ever documenting a thing. Once you solve the problem you don’t have a paper trail of all the things you tried that didn’t work. You just have the best solution that you did and that’s that.

You know what you can’t do with Slack and Discord? Search them through Google. The logs are private. The free tiers remove messages after a fashion. All that knowledge disappears into thin air. Unlike the Wisdom of the Ancients the issues we solve in Slack are gone as soon as you hit your message limit. No one learns from the mistakes because it looks like no one has made them before.

Going the Extra Mile

I’m not advocating for removing Slack and Discord from our daily conversations. Instead, I’m proposing that when we do solve a hard problem or we make a mistake that others might learn from we say something about it somewhere that people can find it. It could be a blog post or a Reddit thread or some kind of indexable site somewhere.

Even the process of taking what you’ve done and consolidating it down into something that makes sense can be helpful. I saw X, tried Y and Z, and ended up doing B because it worked the best of all. Just the process of how you got to B through the other things that didn’t work will go a long way to help others. Yes, it can be a bit humbling and embarrassing to publish something that admits you that you made a mistake. But It’s also part of the way that we learn as humans. If others can see where we went and understand why that path doesn’t lead to a solution then we’ve effectively taught others too.


Tom’s Take

It may be a bit self-serving for me to say that more people need to be blogging about solutions and problems and such, but I feel that we don’t really learn from it unless we internalize it. That means figuring it out and writing it down. Whether it’s a discussion on a podcast or a back-and-forth conversation in Discord we need to find ways to getting the words out into the world so that others can build on what we’ve accomplished. Google can’t search archives that aren’t on the web. If we want to leave a legacy for the DenverCoder10s of the future that means we do the work now of sharing our failures as well as our successes and letting the next generation learn from us.

The Mystery of Known Issues

I’ve spent the better part of the last month fighting a transient issue with my home ISP. I thought I had it figure out after a hardware failure at the connection point but it crept back up after I got back from my Philmont trip. I spent a lot of energy upgrading my home equipment firmware and charting the seemingly random timing of the issue. I also called the technical support line and carefully explained what I was seeing and what had been done to work on the problem already.

The responses usually ranged from confused reactions to attempts to reset my cable modem, which never worked. It took several phone calls and lots of repeated explanations before I finally got a different answer from a technician. It turns out there was a known issue with the modem hardware! It’s something they’ve been working on for a few weeks and they’re not entirely sure what the ultimate fix is going to be. So for now I’m going to have to endure the daily resets. But at least I know I’m not going crazy!

Issues for Days

Known issues are a way of life in technology. If you’ve worked with any system for any length of time you’ve seen the list of things that aren’t working or have weird interactions with other things. Given the increasing amount of interactions that we have with systems that are becoming more and more dependent on things it’s a wonder those known issue lists are miles long by now.

Whether it’s a bug or an advisory or a listing of an incompatibility on a site, the nature of all known issues is the same. They are things that don’t work that we can’t fix yet. They could be on a list of issues to resolve or something that may never be able to be fixed. The key is that we know all about them so we can plan around them. Maybe it’s something like a bug in a floating point unit that causes long division calculations to be inaccurate to a certain number of decimal places. If you know what the issue is you know how to either plan around it or use something different. Maybe you don’t calculate to that level of precision. Maybe you do that on a different system with another chip. Whatever the case, you need to know about the issue before you can work around it.

Not all known issues are publicly known. They could involve sensitive information about a system. Perhaps the issue itself is a potential security risk. Most advisories about remote exploits are known issues internally at companies before they are patched. While they aren’t immediately disclosed they are eventually found out when the patch is released or when someone discovers the same issue outside of the company researchers. Putting these kinds of things under an embargo of sorts isn’t always bad if it protects from a wider potential to exploit them. However, the truth must eventually come out or things can’t get resolved.

Knowing the Unknown

What happens when the reasons for not disclosing known problems are less than noble? What if the reasoning behind hiding an issue has more to do with covering up bad decision making or saving face or even keeping investors or customers from fleeing? Welcome to the dark side of disclosure.

When I worked from Gateway 2000 back in the early part of the millennium, we had a particularly nasty known issue in the system. The ultimate root cause was that the capacitors on a series of motherboards were made with poor quality controls or bad components and would swell and eventually explode, causing the system to come to a halt. The symptoms manifested themselves in all manner of strange ways, like race conditions or random software errors. We would sometimes spend hours troubleshooting an unrelated issue only to find out the motherboard was affected with “bad caps”.

The issue was well documented in the tech support database for the affected boards. Once we could determine that it was a capacitor issue it was very easy to get the parts replaced. Getting to that point was the trick, though. Because at the top of the article describing the problem was a big, bold statement:

Do Not Tell The Customer This Is A Known Issue!!!

What? I can’t tell them that their system has an issue that we need to correct before everything pops and shuts it down for good? I can’t even tell them what to look for specifically when we open the case? Have you ever tried to tell a 75-year-old grandmother to look for something “strange” in a computer case? You get all kinds of fun answers!

We ended up getting creative in finding ways to look for those issues and getting them replaced where we could. When I moved on to my next job working for a VAR, I found out some of those same machines had been sold to a customer. I opened the case and found bad capacitors right away. I told my manager and explained the issue and we started getting them replaced under warranty as soon as the first sign of problems happened. After the warranty expired we kept ordering good boards from suppliers until we were able to retire all of those bad machines. If I hadn’t have known about the bad cap issue from my help desk time I never would have known what to look for.

Known issues like these are exactly the kind of thing you need to tell your customers about. It’s something that impacts their computer. It needs to be fixed. Maybe the company didn’t want to have to replace thousands of boards at once. Maybe they didn’t want to have to admit they cut corners when they were buying the parts and now the money they saved is going to haunt them in increased support costs. Whatever the reason it’s not the fault of the customer that the issue is present. They should have the option to get things fixed properly. Hiding what has happened is only going to create stress for the relations between consumer and provider.

Which brings me back to my issue from above. Maybe it wasn’t “known” when I called the first time. But by the third or fourth time I called about the same thing they should have been able to tell me it’s a known problem with this specific behavior and that a fix is coming soon. The solution wasn’t to keep using the first-tier support fixes of resets or transfers to another department. I would have appreciated knowing it was an issue so I didn’t have to spend as much time upgrading and isolating and documenting the hell out of everything just to exclude other issues. After all, my troubleshooting skills haven’t disappeared completely!

Vendors and providers, if you have a known issue you should admit it. Be up front. Honestly will get you far in this world. Tell everyone there’s a problem and you’re working on a fix that you don’t have just yet. It may not make the customer happy at first but they’ll understand a lot more than hiding it for days or weeks while you scramble to fix it without telling anyone. If that customer has more than a basic level of knowledge about systems they’ll probably be able to figure it out anyway and then you’re going to be the one with egg on your face when they tell you all about the problem you don’t want to admit you have.


Tom’s Take

I’ve been on both sides of this fence before in a number of situations. Do we admit we have a known problem and try to get it fixed? Or do we get creative and try to hide it so we don’t have to own up to the uncomfortable questions that get asked about bad development or cutting corners? The answer should always be to own up to things. Make everyone aware of what’s going on and make it right. I’d rather deal with an honest company working hard to make things better than a dishonest vendor that miraculously manages to fix things out of nowhere. An ounce of honestly prevents a pound of bad reputation.