Getting In Front of Future Regret

Yesterday I sat in on the keynote from Commvault Connections21 and participated in a live blog of it on Gestalt IT. There was a lot of interesting info around security, especially related to how backup and disaster recovery companies are trying to add value to the growing ransomware issue in global commerce. One thing that I did take away from the conversation wasn’t specifically related to security though and I wanted to dive into a bit more.

Reza Morakabati, CIO for Commvault, was asked what he thought teams needed to do to advance their data strategy. And his response was very insightful:

Ask your team to imagine waking up to hear some major incident has happened. What would their biggest regret be? Now, go to work tomorrow and fix it.

It’s a short, sweet, and powerful sentence. Technology professionals are usually focused on implementing new things to improve productivity or introduce new features to users and customers. We focus on moving fast and making people happy. Security is often seen as running counter to this ideal. Security wants to keep people safe and secure. It’s not unlike the parents that hold on to their child’s bicycle after the training wheels come off just to make sure the kids are safe. The kids want to ride and be free. The parents aren’t quite sure how secure they’re going to be just yet.

Regret Storming

Thought exercises make for entertaining ways to scare yourself to death some days. If you spend too much time thinking about all the ways that things can go wrong you’re going to spend far too much energy focused on the negative aspects of your work. However, you do need to occasionally open yourself up to the likelihood that things are going to go wrong at some point.

For the thought exercise above, it’s not crucial to think about how they could go wrong. It’s more important to think about what could be the worst thing that could happen as a result of those bad things and how much you’ll regret it. You need to identify those areas and try to figure out how they can be mitigated.

Let me give you a specific example from my area. In May 2013 a massive tornado ripped through Moore, OK just north of where I live. It was a tragic event that had loss of life. People were displaced and homes and businesses were destroyed. One of the places that was damaged severely was the Moore Public Schools administration building. In the aftermath of trying to clean up the debris and find survivors, one of my friends that worked for an IT vendor told me he spent hours helping sift through the rubble of the building looking for hard disk drives for the district’s servers. Why? Because the tornado had struck just before the payroll for the district’s teachers and staff was due to be run. Without the drives they couldn’t run payroll or print paychecks for those employees. With an even greater need to have funds to pay for food or start repairs on their homes you can imagine that not getting paid was going to be a big deal for those educators and staff.

There are a lot of regrets that came out of the May 2013 tornado. Loss of life and loss of property are always at the top of the list. The psychological damage of enduring something like that is also a huge impact. But for the school district one of the biggest regrets they faced was not having a contingency plan for what to do about paying their employees to help them deal with the disaster. It sounds small in comparison to the millions of dollars of damage that happened but it also represents something important that can be controlled. The school system can’t upgrade the warning system or build businesses that can withstand the most powerful storms imaginable. But they can fix their systems to prevent teachers from going without resources in the event of an emergency.

In this case, the regret is not being able to pay teachers if the district data center goes down. How could we fix that regret today if we had imagined it beforehand? We could have migrated the data center to the cloud so no one weather event could take it out. Likewise, we could have moved to a service that provides payroll entry and check printing that could be accessed from anywhere. We could also have encouraged our teachers and employees to use direct deposit functions to reduce the need to physically print checks. Technology today provides us with a number of solutions to the regret we face. We can put together plans to implement any one of them quickly. We just need to identify the problem and build a resolution for it.

Building Your Future

It’s not easy to foresee every possible outcome. Nor should it be. But if you focus on the feelings those unknown outcomes could bring you’ll have a much better sense for what’s important to protect and how to go about doing it. Are you worried your customer data is going to be stolen and shared on the Internet? Then you need to focus your efforts on protecting it. Are you concerned your AWS bill is going to skyrocket if someone steals your credentials and starts borrowing your resource pool? Then you need to have governance in place to prevent unauthorized users from doing that thing.

You don’t have to have a solution for every possible regret. You may even find that some of the things you thought you might end up regretting are actually pretty mild. If you’re not concerned about what would happen to your testing environment because you can just clone it from a repository then you can put that to bed and not worry about it any longer. Likewise, you may discover some regrets you didn’t anticipate. For example, if you’re using Active Directory credentials to back up your server data, you need to make sure you’re backing up Active Directory as well. You’re going to find yourself infuriated if you have the data you need to get back to business but it’s locked behind cryptographic locks that you can’t open because someone forgot to back up a domain controller.


Tom’s Take

I’ve been told that I’m somewhat negative because I’m always worried about what could go wrong with a project or an event. It’s not that I’m a pessimist as much as I’ve got a track record for seeing how things can go off the rails. Thanks to Commvault I’m going to spend more time thinking of my regrets and trying to plan for them to be mitigated ahead of time so all the possible ways things could fail won’t consume my thoughts. I don’t have to have a plan for everything. I just need to get in front of the regrets before I feel them for real.

Fast Friday Thoughts From Security Field Day

It’s a busy week for me thanks to Security Field Day but I didn’t want to leave you without some thoughts that have popped up this week from the discussions we’ve been having. Security is one of those topics that creates a lot of thought-provoking ideas and makes you seriously wonder if you’re doing it right all the time.

  • Never underestimate the value of having plumbing that connects all your systems. You may look at a solution and think to yourself “All this does is aggregate data from other sources”. Which raises the question: How do you do it now? Sure, antivirus fires alerts like a car alarm. But when you get breached and find out that those alerts caught it weeks ago you’re going to wish you had a better idea of what was going on. You need a way to send that data somewhere to be dealt with and cataloged properly. This is one of the biggest reasons why machine learning is being applied to the massive amount of data we gather in security. Having an algorithm working to find the important pieces means you don’t miss things that are important to you.
  • Not every solution is going to solve every problem you have. My dishwasher does a horrible job of washing my clothes or vacuuming my carpets. Is it the fault of the dishwasher? Or is it my issue with defining the problem? We need to scope our issues and our solutions appropriately. Just because my kitchen knives can open a package in a pinch doesn’t mean that the makers need to include package-opening features in a future release because I use them exclusively for that purpose. Once we start wanting the vendors to build a one-stop-shop kind of solution we’re going to create the kind of technical debt that we need to avoid. We also need to remember to scope problems so that they’re solvable. Postulating that there are corner cases with no clear answers are important for threat hunting or policy creation. Not so great when shopping through a catalog of software.
  • Every term in every industry is going to have a different definition based on who is using it. A knife to me is either a tool used on a campout or a tool used in a kitchen. Others see a knife as a tool for spreading butter or even doing surgery. It’s a matter of perspective. You need to make sure people know the perspective you’re coming from before you decide that the tool isn’t going to work properly. I try my best to put myself in the shoes of others when I’m evaluating solutions or use cases. Just because I don’t use something in a certain way doesn’t mean it can’t be used that way. And my environment is different from everyone else’s. Which means best practices are really just recommended suggestions.
  • Whatever acronym you’ve tied yourself to this week is going to change next week because there’s a new definition of what you should be doing according to some expert out there. Don’t build your practice on whatever is hot in the market. Build it on what you need to accomplish and incorporate elements of new things into what you’re doing. The story of people ripping and replacing working platforms because of an analyst suggestion sounds horrible but happens more often than we’d like to admit. Trust your people, not the brochures.

Tom’s Take

Security changes faster than any area that I’ve seen. Cloud is practically a glacier compare to EPP, XDR, and SOPV. I could even make up an acronym and throw it on that list and you might not even notice. You have to stay current but you also have to trust that you’re doing all you can. Breaches are going to happen no matter what you do. You have to hope you’ve done your best and that you can contain the damage. Remember that good security comes from asking the right questions instead of just plugging tools into the mix to solve issues you don’t have.

Choosing the Least Incorrect Answer

My son was complaining to me the other day that he missed on question on a multiple choice quiz in his class and he got a low B grade instead of getting a perfect score. When I asked him why he was frustrated he told me, “Because it was easy and I missed it. But I think the question was wrong.” As usual, I pressed him further to explain his reasoning and found out that the question was indeed ambiguous but the answer choices were pretty obviously wrong all over. He asked me why someone would write a test like that. Which is how he got a big lesson on writing test questions.

Spin the Wheel

When you write a multiple choice test question for any reputable exam you are supposed to pick “wrong” answers, known as distractors, that ensure that the candidate doesn’t have a better than 25% chance of guessing the correct answer. You’ve probably seen this before because you took some kind of simple quiz that had answers that were completely wrong to the point of being easy to pick out. Those quizzes are usually designed to be passed with the minimum amount of effort.

This also extends to a question that includes answer choices that are paired. If you write a question that says “pick the three best answers” with six options that are binary pairs you’re basically saying to the candidate “Pick between these two three times and you’re probably going to get it right”. I’ve seen a number of these kinds of questions over the years and it feels like a shortcut to getting one on the house.

The most devious questions come from the math side of the house. Some of my friends have been known to write questions for their math tests and purposely work the problem wrong at a critical point to get a distractor that looks very plausible. You make the same mistake and you’re going to see the correct answer in the choices and get it wrong. The extra effort here matters because if you see too many students getting the same wrong distractor as the answer you know that there may be confusion about the process at that critical point. Also, the effort to make math question distractors look plausible is impressive and way too time consuming.

Why Is It Wrong?

Compelling distractors are a requirement for any sufficiently advanced testing platform. The professionals that write the tests understand that guessing your way through a multiple choice exam is a bad precedent and the whole format needs to be fair. The secret to getting the leg up on these exams is more than just knowing the right answer. It’s about knowing why things are wrong.

Take an easy example: OSPF LSAs. A question may ask you about a particular router in a diagram and ask you which LSAs that it sees. If the answer choices are fairly configured you’re going to be faced with some plausible looking answers. Say the question is about a not-so-stubby-area (NSSA). If you know the specifics of what makes this area unique you can start eliminating choices from the question. What if it’s asking about which LSAs are not allowed? Well, if you forgot the answer to that you can start by reading the answer choices and applying logic.

You can usually improve your chances of getting a question right by figuring out why the answers given are wrong for the question. In the above example, if LSA Type 1 is listed as an answer choice ask yourself “Why is this the wrong answer?” For the question about disallowed LSA types you can eliminate this choice because LSA Type 1 is always present inside an area. For a question about visibility of that LSA outside of an area you’d be asking a different question. But if you know that Type 1 LSAs are local and always visible you can cross off that as a potential answer. That means you boosted your chances of guessing the answer to 33%!

The question itself is easy if you know that NSSAs use Type 7 LSAs to convey information because Type 5 LSAs aren’t allowed. But if you understand why the other answers are wrong for the question asked you can also check your work. Why would you want to do that? Because the wording of the question can trip you up. How many times have you skimmed the question looking for keywords and missing things like “not” or “except”? If you work the question backwards looking for why answers are wrong and you keep coming up with them being right you may have read the question incorrectly in the first place. Likewise, if every answer is wrong somehow you may have a bad question on your hands.

What happens if the question is poorly worded and all the answer choices are wrong? Well, that’s when you get to pick the least incorrect answer and leave feedback. It’s not about picking the perfect answer in these situations. You have to know that a lot of hands touch test questions and there are times when things are rewritten and the intent can be changed somehow. If you know that you are dealing with a question that is ambiguous or flat-out wrong you should leave feedback in the question comments so it can be corrected. But you still have to answer the question. So, use the above method to find the piece that is the least incorrect and go with that choice. It may not be “right” according to the test question writer, but if enough people pick that answer you’re going to see someone taking a hard look at the question.


Tom’s Take

We are going to take a lot of tests in our lives. Multiple choice tests are easier but require lots of work, both on the part of the writer and the taker. It’s not enough to just memorize what the correct answers are going to be. If you study hard and understand why the distractors are incorrect you’ll have a more complete understanding of the material and you’ll be able to check your work as you go along. Given that most certification exams don’t allow you to go back and change answers once you’ve moved past the question the ability to check yourself in real time gives you an advantage that can mean the difference between passing and retaking the exam. And that same approach can help you when everything on the page looks wrong.

What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

Chip Shortages Aren’t Sweet for Networking

Have you tried to order networking gear recently? You’re probably cursing because the lead times on most everything are getting long. It’s not uncommon to see lead times on wireless access points or switch gear reaching 180 days or more. Reports from the Internet say that some people are still waiting to get things they ordered this spring. The prospect of rapid delivery of equipment is fading like the summer sun.

Why are we here? What happened? And can we do anything about it?

Fewer Chips, More Air

The pandemic has obviously had the biggest impact for a number of reasons. When a fabrication facility shuts down it doesn’t just ramp back up. Even when all the workers are healthy and the city where it is located is open for business it takes weeks to bring everything back online to full capacity. Just like any manufacturing facility you can’t just snap your fingers and get back to churning out the widgets.

The pandemic has also strained supply chains around the world. Even if the fabs had stayed open this entire time you’d be looking at a shortage of materials to make the equipment. Global supply chains were running extremely lean in 2019 and exposing one aspect of them has created a cascade effect that has caused stress everywhere. The lack of toilet paper or lunchmeat in your grocery store shows that. Even when the supply is available the ability to deliver it is impacted.

The supply chain problem also belies the issue on the other side of the shipping container. Even if the fabs had enough chips to sell to anyone that wanted them it’s hard to get those parts delivered to the companies that make things. If this were simply an issue of a company not getting the materials it needed to make a widget in a reasonable time there wouldn’t be as much issue. But because these companies make things that other companies use to make things the hiccups in the chain are exacerbated. If TSMC is delayed by a month getting a run of chips out, that month-long delay only increases for those down the line.

We’ve got issues getting facilities back online. We’ve got supply chains causing problems all over the place. Simple economics says we should just build more facilities, right? The opportunity costs of not having enough production around means we have ample space to make more of the things we need and profit. You’re right. Companies like Intel are bringing new fabs online as fast as they can. Sadly, that is a process that is measured in months or even years. The capacity we need to offset the disruption to the chip market should have been built two years ago if we wanted it ready now.

All of these factors are mixed into one simple truth. Without the materials, manufacturing, or supply chain to deliver the equipment we’re going to be left out in the cold if we want something delivered today. Just in Time inventory is about to become Somewhere in Time inventory. We’re powerless to change the supply chain. Does that means we’re powerless to prevent disruption to our planning process?

Proactive Processes

We may not be able to assemble networking gear ourselves to speed up the process but we are far from helpless. The process and the planning around gear acquisition and deployment has to change to reflect the current state of the world. We can have an impact provided we’re ready to lead by example.

  • Procure NOW: Purchasing departments are notorious for waiting until the last minute to buy things. Part of that reasoning is that expenditures are worth less in the future than right now because those assets are more valuable today gaining interest or something. You need to go to the purchasing department and educate them about how things are working right now. Instead of them sitting on the project for another few months you need to tell them that the parts have to be ordered right now in order for them to be delivered in six or seven months. They’re going to fight you and tell you that they can just wait. However, we all know this isn’t going to clear up any time soon. If they persist in trying to tell you that you need to wait just have them try to go car shopping to illustrate the issue. If you want stuff by the end of Q1 2022, you need to get that order in NOW.
  • Preconfigure Things However You Can: If you’re stuck waiting six months to get switches and access points, are you going to be stuck waiting another month after they come in to configure them? I hope that answer is a resounding “NO”. There are resources available to make sure you can get things configured now so you’re not waiting when the equipment is sitting on a loading dock somewhere. You need to reach out to your VAR or your vendor and get some time on lab gear in the interim. If you ordered a wireless controller or a data center switch you can probably get some rack time on a very similar device or even the exact same one in a lab somewhere. That means you can work on a basic configuration or even provision things like VLANs or SSIDs so you’re not recreating the wheel when things come in. Even if all you have is a skeleton config you’re hours ahead of where you would be otherwise. And if the VAR or vendor gives you a hard time about lab gear you can always remind them that there are other options available for the next product refresh.
  • Minimum Viable Functionality: All this advice is great for a new pod or an addition to an existing network that isn’t critical. What if the gear you ordered is needed right now? What if this project can’t wait? How can you make things work today with nothing in hand? This is a bit trickier because it will require duplicate work. If you need to get things operational today you need to work with what you have today. That means you may have to salvage an old lab switch or pull something out of production and reduce available ports until the gear can arrive. It also means you’re going to have to backup the old configs, erase them completely (don’t forget about the VLAN database and VTP server configurations), and then put on the new info. When the new equipment comes in you’re going to have to do it all over again in reverse. It’s more work but it leads to things being operational today instead of constantly telling someone that it’s going to be a while. If you’re a VAR that’s doing this for a customer, you’d better make it very clear this is temporary and just a loan. Otherwise you might find your equipment being a permanent addition even after everything comes in.

Tom’s Take

The chip shortage is one of those things that’s going to linger under the best of circumstances. We’re going to be pressed to get gear in well into 2022. That means delayed projects and lots of arguing about what’s critical and what’s not. We can’t fix the semiconductor sector of the market but we can work to make sure that the impact felt there is the only one that impacts us right now. The more we do ahead of time to make things smooth the better it will be when it’s finally time to make things happen. Don’t let the lack of planning on the part of the supply chain sour your outlook on doing your role in networking.