Problem Replication or Why Do We Need to Break It Again?

There was a tweet the other day that posited that we don’t “need” to replicate problems to solve them. Ultimately the reason for the tweet was that a helpdesk refused to troubleshoot the problem until they could replicate the issue and the tweeter thought that wasn’t right. It made me start thinking about why troubleshooters are so bent on trying to make something happen again before we actually start trying to fix an issue.

The Definition of Insanity

Everyone by now has heard that the definition of insanity is doing the same thing over and over again and expecting a different result. While funny and a bit oversimplified the reality of troubleshooting is that you are trying to make it do something different with the same inputs. Because if you can make it do the same thing over and over again you’re closer to the root cause of the issue.

Root cause is the key to problem solving. If you don’t fix what’s actually wrong you are only dealing with symptoms and not issues. However, you can’t know what’s actually wrong until you can make it happen more than once. That’s because you have to narrow the actual issue down from all of the symptoms. If you do something and get the same output every time then you’re on the right track. However, if you change the inputs and get the same output you haven’t isolated the issue yet. It’s far too common in troubleshooting to change a bunch of things all at once and not realize what actually fixed the problem.

Without proper isolation you’re just throwing darts. As the above comic illustrates sometimes the victory is causing a different error message. That means you’re manipulating the right variables or inputs. It also means you know what knobs need to be turned in order to get closer to the root cause and the solution. So repeatability in this case is key because it means you’re not trying to fix things that aren’t really broken. You’re also narrowing the scope of the fixes by eliminating things that don’t need to be monitored.

Resource Utilization

How long does it take to write code? A couple of hours? A day? Does it take longer if you’re trying to do tests and make sure your changes don’t break anything else? What about shipping that code as part of the DevOps workflow? Or an out-of-band patch? Do you need to wait until the next maintenance window to implement the changes? These are all extremely valuable questions to ask to figure out the impact of changes on your environment.

Now multiply all of those factors by the number of things you tried that didn’t work. The number of times you thought you had the issue solved only to get the same error output. You can see how wasting hours of a programming team working on things can add up to the company’s bottom line quickly. Nothing is truly free or a sunk cost. You’re still paying people to work on something, whether it’s fixing buggy code or writing a new feature for your application. If they’re spending more of their time tracking down bugs that happen infrequently with no clear root cause are they being used to their highest potential?

What happens when that team can’t fix the issue because it’s too hard to cause it again? Do they bring in another team? Pay for support calls? Do you throw up your hands and just live with it? I’ve been involved in all of these situations in the past. I’ve tried to replicate issues to no avail only to be told that they’ll just live with the bug because they can’t afford to have me work on it any longer. I’ve also spent hours trying to replicate issues only to find out later that the message has only ever appeared once and no one knows what was going on when it happened. I might have been working for a VAR at the time but my time was something the company tracked closely. Knowing up front the issue had only ever happened once might have changed the way I worked on the issue.

I can totally understand why people would be upset that a help desk closed a ticket because they couldn’t replicate an issue. However, I also think that prudence should be involved in any structured troubleshooting practice. Having been on the other side of the Severity One all-hands-on-deck emergencies only to find a simple issue or a non-repeatable problem in the past I can say that having that many resources committed to a non-issue is also maddening as a tech and a manager of technical people.

Tom’s Take

Do you need to replicate a problem to troubleshoot it? No. But you also don’t need to follow a recipe to bake a cake. However, it does help immensely if you do because you’re going to get a consistent output that helps you figure out issues if they arise. The purpose of replicating issues when troubleshooting isn’t nefarious or an attempt to close tickets faster. Instead it’s designed to help speed problem isolation and ensure that resources are tasked with the right fixes to the issues instead of just spraying things and hoping one of them gets the thing taken care of.

Making It Work in 2023

We’re back to the first of the year once again. January 1, 2023 is a Sunday which feels somewhat subdued. That stands in contrast to the rest of the year that felt like a rollercoaster always one heartbeat away from careening out of control. As is the tradition, I’ll look at the things I wanted to spend more time working on in 2022:

  • More Analytical Content: I have to honestly give myself a no on this one, at least from a technical perspective. I did spend some time making analytical content for my Tomversations series. However, the real difference in analytical content came from my posts about leadership and more “soft skill” focused ideas. I’ve gotten more comments about those posts than anything in 2022 and I couldn’t be more proud.
  • Saying No to More Things: This is the part where I would insert an animated GIF of someone laughing manically. While I did make strides in telling people that I have way too much going on to take care of one extra thing the reality is that I took on more things that I probably should have. That’s something that I definitely do need to change but the real hard part isn’t saying No. It’s making it stick.
  • Getting In Front of Things: This one actually was one that I had the hardest time with. I was able to work on some of this in my Tech Field Day job but it was my blog that suffered the most. I wanted to start spending more time thinking about post topics earlier in the week so I wasn’t always posting on Fridays. The irony is that toward the end of the year I did manage to get some posts out on other days. I just did it because I was having severe writer’s block and I was late on several posts. In fact, I technically missed a post in 2022 for the first time in twelve years. I think it’s more a reflection on how important it is to keep your eye on the prize as you create content. I don’t have metrics like those on YouTube channels driving me to put out 2-3 videos a week. Instead I really have to make sure I’m aware of what needs to happen to keep the content coming out.

2022 was a year of getting things back to a version of normal but it also was a year of me trying to implement things that didn’t go the way I wanted. That seems to be the challenge for all good intentions. So let’s look at what I want to do in 2023 to keep myself from struggling the way that I did before.

  • Keeping Track of Things: I knew I was in for trouble when I stopped writing things down two weeks into the year. My plans for bullet journaling seemed to evaporate because I tried to make changes that didn’t stick. So I’m resetting it back to Square Zero. I’m writing everything down somewhere and I’m not going to forget it this time. I know where my notes are and I know what I need to do to keep myself on track. The difference between writing it down when I do it and just trying to remember to do it is very, very stark. So rather than thinking I have a good memory and forgetting that I don’t I’m not leaving anything up to chance.
  • Creating Evergreen Content: During my workouts this year I’ve spent more time listening to podcasts like Hidden Brain and Huberman Lab which focus on behavior and the brain. Why? Because I’m fascinated to learn why we think and do the things we do. When you couple that with my outside work in the Wood Badge leadership program I think I’m starting to see the value of creating content more around skills and leadership instead of just talking about the next iteration of wireless or SD-WAN. That doesn’t mean my technical writing is going to go away. It does mean that I’m going to try to sprinkle in more posts about mentoring, leadership, and creating a culture that will help you pay dividends unlike ATM, HD-DVD, and several other forgotten technologies.
  • Ensuring Intentionality: This one feels a bit nebulous but that’s because it’s hard to pin down what being intentional really means to everyone. I’ve used the world a bit more in the latter half of 2022 when discussing certain aspects of my job and it seems to have stuck with others. Intentionality to me means making sure that I’m focused on ensuring outcomes happen. A lack of intentionality is like gathering the ingredients for a cake, assembling them on a table, and then hoping that a cake somehow magically appears. You have to combine the ingredients in the proper amounts and make things happen to create a cake. Even with all the right conditions you still need to do the work to make things happen. And if you think that’s an easy thing to do take a look around in your office and tell me just how many people that you work with that aren’t intentional in what they do.

Tom’s Take

One of the things that I can look back on over the past few years is that my first of the year posts have reflected the challenges that I’ve faced. The shifting landscape of content has forced me to look at what I create. The challenges of a world that went into hibernation for a while have changed the way I look at how I get my work done. Even the growth that I’ve experienced over the years has shifted my thinking. I won’t be the next big YouTube star. My gift is in writing. And focusing on that as my starting point for the year to come is going to help me make things work.

Testing Your Weakest Links as a Chain

You may have heard in the news this week that there was a big issue with Southwest Airlines this holiday season. The issues are myriad and this is going to make for some great case studies for students in the future. However, one thing I wanted to touch on briefly in this whole debacle was the issue of a cascade failure.

The short version is that a weather disruption in the flight schedule became a much bigger problem when the process for rescheduling the flight crews was overwhelmed. Turns out that even after the big computer system upgrades and all the IT work that has gone into putting together a modern airfare booking system that one process was still very manual. The air crew rescheduling department was relatively small in nature and couldn’t keep up with the demands placed on it by the disruptions. It got to the point where Southwest had to reduce their number of flights in order to get the system back to normal.

Worst Case Scenario

I’m not an expert at airline scheduling but I have spent a lot of time planning for disaster recovery. One of the things that we focus on more than anything else is the recovery aspect. The whole purpose of the plan is to get things up and running, right? That requires a look at the big picture of what data needs to be saved and where it needs to be stored and how people are going to do it. In the above example the focus of the airline was getting passengers booked on flights as soon as possible.

However, the details matter just as much as the big picture. If you don’t know the process at every step of the way you’re going to find that the weakest link in the chain is the one that breaks. All the upgrades in the world for remote storage or immutable snapshots won’t matter if someone doesn’t have a key to the data center to turn everything on. Just ask the engineers at Facebook that didn’t realize the door controls for the data center relied on the internal systems that were unreachable during their 2021 outage.

How can you catch these little details? How can you be so sure that everything isn’t going to fall apart because someone forgot the keys to the closet with the disaster recovery binder? The answer, of course, is testing. You’re going to have to test every aspect of the plan from top to bottom. Most everyone will agree that you have to test everything properly to make sure no one problem overwhelms the system. However, that’s where this whole thing falls apart.

Forest for the Trees

If you’re just looking at the individual aspects of your disaster recovery plan in a vacuum you’re going to have a miserable time of it when things go wrong. Unit testing is a popular way to look at the components of the plan to ensure they work without incurring too much cost or too many resources for the testing. However, unit testing alone doesn’t look at the way the details integrate.

That’s where integration testing comes into play. It’s not enough to check the individual pieces. Maybe the computer system is good at rescheduling passengers and balancing the gate assignments. However, if they can’t get on the plane because the system doesn’t think there is a crew due to the way the process interacts with a different area then you don’t have a functional system no matter how great one part of it is.

Disaster recovery tests can be done at the unit level to make sure new modules or processes are solid but you have to make sure you have integrated full tests at least once every six months or so. You have to find the holes in the system caused by the interactions of the details. Sure, the generators might fire up on cue but what if someone is parked in front of the fuel delivery area? What if the keys to the backup cages are on someone’s keychain instead of in a central area? These are questions you want to answer before everyone is running around with their hair on fire.

More importantly, when this happens you need to document all of it. If there is a particular integration that fails you need to write it down and discuss it with your teams. Understand why it happened and put process and procedure in place to cover it. Then make sure that everyone is aware the plan was updated. If people think that something has to be done a certain way because that’s the way it’s always done they’re going to keep doing it that way until they are told differently. Communication is key in any kind of adverse situation.

Lastly, be honest when you’re evaluating these process failures. Don’t try to explain it away or minimize the impact. If someone didn’t do their job then make sure everyone knows what needed to happen and how it failed. If a system doesn’t work properly then analyze the system and fix it. Don’t throw blame where it’s not warranted but don’t explain it away to salve an ego. You need to make the process work and make it work every time so that you don’t run into issues again.

Tom’s Take

Disasters happen. If you’re really lucky you will have something in place to keep the disaster from spiraling out of control. The plural of lucky is good and in order to be good you need to analyze how the process works in concert with every component to make sure there are no weak links. If the chain breaks like it did for Southwest you’ll be very lucky to lose money and customers. If you’re not lucky you’ll lose a lot more.

The Power of Complaining Properly

Recently I’ve started listening to a new podcast all about the brain and behaviors called Hidden Brain. It’s got a lot great content and you should totally check it out. One of the latest episodes deals with complaining and how it can make us less productive and more likely to repeat patterns or shut people out.

Complaining is as old as language. I’m sure as soon as the first person to create communications around spoken words was able to teach another person one of the first things they did was complain about the weather or something they hated. Our mind is built to express itself about things we don’t like, such as bad drivers or silly behaviors at work.

The episode explores the ways that our brain can trap us in cycles of complaining simply for the sake of complaining. It also discusses how we should try to spend more time trying to be productive in how we address complaints. I’ve experienced this a lot in IT as well as in my career after being directly involved in IT and there’s a lot of merit in changing the way we complain about things.

Airing Grievances

Complaining without a suggested solution is just whining.

I’ve always found complaining just for the sake of complaining to be counterproductive. Sure, it might feel awesome to just let it all out and pick apart someone’s decision making process or their personality but that’s not sustainable long term. As in the episode above when you spend your time complaining just for the sake of complaints you eventually fall into a pattern and you can’t break out of it. We all have that one friend or coworker that comes to us and complains about stuff no matter what, right?

In part this happens because we create an agreeable environment for it. That’s not always a bad thing. People sometimes just want to complain. If you’ve ever had to deal with someone getting upset because you didn’t just agree with their complaints you know how that can go. There are those in society that would rather just let it all out without disagreement or challenge.

The opposite side of that situation is when someone is challenging our assumptions or forcing us to see things in a different light. I’m sure that everyone reading this can think of someone they know that will show them another side of the argument or help them understand the path to solving issues instead of just whining about them. This person is someone you may not go to all the time because you realize they’re going to make you confront what’s going on instead of just agreeing with it.

We cultivate both of these kinds of people in our circles. We have those we will commiserate with and those we will seek out for help. So how do we manage to spend more time on fixing issues instead of just falling into the patterns of whining and regressive behavior?

Outcomes Over Opining

The first key to figuring out how to break out of the cycle and focus on making this better is a trick I use with others that only want to complain about things in my presence. They want to tell me everything that’s wrong, or more accurately what I’ve done wrong. So I ask a simple question:

How would you like this situation resolved?

It sounds almost too simple. However, if you think about the above examples you realize there are those that simply want to complain. They may not have a solution in mind. Think of those on social media that just want to air their grievances about a company or a person on a perpetual basis. Are they looking to change the situation? Or would they just prefer to complain? Once you ask the person, or ask yourself, how they want the situation resolved then you’ve moved past complaining to a solution.

Once you’re able to break out of the complaining loop you need to keep the conversation focused on the outcome. It’s easy to slip back into complaining and whining mode when you lose sight of the goal. If the solution is to recognize that things need to improve work on the plan of improvement. Have a goal in mind. Is the solution to have better service in a restaurant? Or to not have something cost so much if it is of inferior quality? By making the outcome the focus you channel the negativity into something that can be positive. One other side effect of the focus on the outcome is that continued complaining will fall on deaf ears and usually shorten the conversation. Even if the person has a solid outcome in mind they’ll lose interest if the sole purpose of the conversation is venting instead of productive work.

Lastly, understand that this is really focused on complaining on a non-personal level. Personal discussions are often not going to have an outcome in mind. Maybe the goal is to just vent. That’s why I usually ask now if I’m serving as emotional support or problem solving. However, in a business environment the goal should be the outcome. Especially if it’s a conflict or a complaint from a team member. The goal should be reducing friction and not just being a sounding board for those that would rather expend energy on the problem and not the solution.

Tom’s Take

I complain, just like any other normal person does. Sometimes my complaints are just ways to get my emotional weight off my shoulders. However I have always subscribed to the idea that I need to have an outcome in mind to fix what is causing my issues. Sometimes that outcome is far outside of my control, such as fixing someone’s personality. Other times it is very much in my control but will require work on my part to make it happen. That’s where I always ask myself how much I want this issue resolved. I make sure I’m ready to invest the energy to make it better before I even start.Odds are good that if I’m complaining I’m talking myself into making it better. To me, that’s the power of a proper complaint.

The Power of Continuing Education on Certifications

I’m about six months away from recertifying my CCIE and even though I could just go Emeritus now I’m working on completing some continuing education at the end of the year to push it out another three years. I am once again very thankful that Cisco has this as an option instead of taking a test over and over again as the only option to renew my certifications.

As I embark on another journey to keep myself current in the networking community, I realize that the flexibility that education credits offer is more important that just passing a test or learning a new skill. Employers should also be thrilled that knowledge workers have the ability to work on other skills and be recognized for them. Because there are two different paths that this can lead to.

To Be The Best

One of the things that most professionals recognize with continuing education is that you can leverage your skills to race through things. If you’re already an expert at something like BGP or spanning tree why not take courses to improve the depth of your knowledge? This is part of the reason why there are a number of double CCIEs that have Routing and Switching and Service Provider. The skill sets have a big overlap which makes the additional study to pass the other relatively painless.

Taking pride in practicing the same skill set over and over again is something we traditionally associate with athletes and other skill positions. It is a very valid way of showing everyone that you truly are an expert at your craft. Knowing every nuance of the protocol or understanding it to a degree not possessed by anyone else is a real accomplishment. The value you gain in troubleshooting situations is unmatched. It’s easy to become the authoritative source on something because you’ve literally studied every piece of material on it and you know it inside out.

The downside of this kind of approach is that you naturally gravitate toward being an expert on exactly one or two things. Like a cake baker you are great at making one specific kind of thing. You may have more than enough work to keep you occupied for years but if the market shifts you may find yourself in trouble. The deep learning method works with technology that doesn’t get superseded quickly. IP routing is here to stay but we also said the same thing about traditional telephony and FORTRAN. Those may still exist in some form today and the experts are still needed to make them work but they aren’t nearly as big as they used to be.

Covering the Rest

The opposite of a deep expert is one that has a wide breadth of knowledge. This is the area where I feel a continuous learning program really shines. That’s because access to knowledge outside of your specific discipline can be hard to come by without help. Having a list of approved courses for a CE portal steers you in a good direction to take advantage of these offerings.

I remember telling people that I knew I was starting to gain on my knowledge and certification journey when I stopped finding the books I needed at the local book store. That’s absolutely true for those that are trying to reach the pinnacle of their specific skill set. However, those basic books are great to jump into an area you may not be familiar with.

You may think that you can spend your time studying and practicing and getting expert skill levels in a few key areas but you also need to realize that things can shift. Networking professionals today also need to understand programming and cloud and many other aspects of enterprise IT. It’s not even a case that knowing how to use those things is just easier. Instead it’s a case of requiring knowledge in those areas to understand how they interact so you can build more complete systems. You might be able to work on technology with a specific skill set but you won’t be able to work on anything new if you don’t know how all the parts work together.

You may not like the idea of studying lots of different areas of knowledge and that’s totally fine. But if you don’t at least understand that some knowledge of other areas is needed you’re going to find yourself opting out of many opportunities to work on things that are going to be important later.

Tom’s Take

You can choose to be the deep expert or the designer with breadth. The important thing is that the choice is yours thanks to the foresight of companies that embrace a model of learning over regurgitation. If you want to pick up new skills and get credit for them you can. If you’d prefer to be the best at a given discipline then the world is your oyster. No matter what you have the ability to make a choice that isn’t studying for a test every couple of years that doesn’t expand your knowledge. To me, the real value of a CE program is how it makes us all better.

ChatGPT and Creating For Yourself

I’m sure you’ve been inundated by posts about ChatGPT over the past couple of weeks. If you managed to avoid it the short version is that there is a new model from OpenAI that can write articles, create poetry, and basically answer your homework. Lots of people are testing it out for things as mundane as writing Amazon reviews or creating configurations for routers.

It’s not a universal hit though. Stack Overflow banned ChatGPT code answers because they’re almost always wrong. My own limited tests show that it can create a lot of words from a prompt that seem to sound correct but feel hollow. Many others have accused the algorithm of scraping content from others on the Internet and sampling it into answers to make it sound accurate but not the best answer to the question.

Are we ready for AI to do our writing for us? Is the era of the novelist or technical writer finished? Should we just hang up our keyboards and call it a day?

Byte-Sized Content

When I was deciding what I wanted to do with my life after college I took the GMAT to see if I could get into grad school for an MBA. I scored well on the exam but not quite to the magical level to get a scholarship. However, one area that I did do surprising well in for myself was the essay writing section. I bought a prep book that had advice for the major sections but spent a lot of time with the writing portion because it was relatively new at the time and many people were having issues with how to write an essay. The real secret is that the essay was graded by a computer, so you just had to follow a formula to succeed:

  1. Write an opening paragraph covering what you’re going to say with three points of discussion.
  2. Write a paragraph about point 1 and provide details to support it.
  3. Report for points 2 & 3
  4. Write a summary paragraph restating what you said in the opening.

That’s it. That’s the formula to win the GMAT writing portion. The computer isn’t looking for insightful poetry or groundbreaking sci-fi world building. It’s been trained to look for structure. Main idea statements, supporting evidence, and conclusions all tick boxes that provide points to pass the section.

If all that sounds terribly boring and formulaic you’re absolutely right. Passing a test of competence isn’t the same and pushing the boundaries of the craft. A poet like e e cummings would have failed because his work has no structure and contains capitalization errors compared to the standards of grammar. Yet no one would deny that he is a master of his craft. Likewise, always following the standards is only important when you want to create things that already exist.

Free Thinking

Tech writing is structured but often involves new ideas that aren’t commonplace. How can you train an algorithm to write about Zero Trust Network Architecture or VR surgery if no examples of that exist yet? Can you successfully tell ChatGPT to write about space exploration through augmented reality if no one has built it yet? Even if you asked would you know what sounded correct from the reply.

Part of the issue comes from content consumption. We read things and assume they are correct. Words were written so they must have been researched and confirmed before being committed to the screen. Therefore we tend to read content in a passive form. We’re not reacting to what we’re seeing but instead internalizing it for future use. That’s fine if we’re reading for fun or not thinking critically about a subject. But for technical skills it is imperative that we’re constantly challenging what’s written to ensure that it’s accurate and useful.

If we only consumed content passively we’d never explore new ideas or create new ways to achieve outcomes. Likewise, if the only content we have is created by algorithm based on existing training and thought patterns we will never evolve past the point we are today. We can’t hope that a machine will have the insight to look beyond the limitations imposed upon it by the bounds of the program. I talked about this over six years ago where I said that machine learning would always give you great answers but true AI would be able to find them where they don’t exist.

That’s my real issue with ChatGPT. It’s great at producing content that is well within the standard deviation of what is expected. It can find answers. It can’t create them. If you ask it how to enter lunar orbit it can tell you. But if you ask it how to create a spacecraft to get to a moon in a different star system it’s going to be stumped. Because that hasn’t been created yet. It can only tell you what it’s seen. We won’t evolve as a species unless we remember that our machines are only as good as the programming we impart to them.

Tom’s Take

ChatGPT and programs like Stable Diffusion are fun. They show how far our technology has come. But they also illustrate the importance that we as creative beings can still have. Programs can only create within their bounds. Real intelligence can break out of the mold and go places that machines can’t dream of. We’ve spent billions of dollars and millions of hours trying to train software to think like a human and we’ve barely scratched the surface. What we need to realize is that while we can write software that can approximate how a human can think we can never replace the ability to create something from nothing.

DPUs Could Change The Network Forever

You wouldn’t think that AWS re:Invent would be a big week for networking, would you? Most of the announcements are focused on everything related to the data center but teasing out the networking specific pieces isn’t as easy. That’s why I found mention of a new-ish protocol in an unrelated article to be fascinating.

In this Register piece about CPUs there’s a mention of the Nitro DPU. More importantly there’s also a reference to something that Amazon has apparently been working on for the last couple of years. It turns out that the world’s largest online bookstore and data center company is looking to get rid of TCP.

Rebuilding Transport

The new protocol was developed in 2020. Referred to as Scalable Reliable Datagram (SRD), it was build to solve specific challenges Amazon was seeing related to performance in their cloud. Amazon decided that TCP had bigger issues for them that they needed to address.

The first was that dropped packets required retransmission. In an environment like the Internet that makes sense. You want to get the data you lost. However, when TCP was developed fifty years ago the amount of data that was lost in transit was tiny compared to the flows of today. Likewise, TCP doesn’t really know how to take advantage of multi path flows. That’s because any packet that arrives out-of-order requires the whole thing to be reassembled in order to be read by the operating system. That makes total sense when you think about the relative power of a computer back in the 70s and 80s. If the CPU is already super busy trying to do other things you don’t want it to have to try to deal with a messy stream of packets too. Halting the flow until it’s reassembled makes sense.

Today, Amazon would rather have the packets arrive on multiple different paths and be reassembled higher up the stack. Instead of forcing the system to stop transmitting the entire flow until the lost or out-of-order pieces are fixed they’d rather get the whole thing delivered and start putting the puzzle together while the I/O interface is processing the next flow. That’s because the size of the flows is creating communications issues between systems. They would rather have slightly higher latency to increase performance.

DPUs Or Bust

If this is so obvious, why has no one rewritten TCP before? Well, we hinted at it when we discussed the fact that TCP is stopping the flow to sort out the mess. The CPU will be fully tasked with reassembling flows with something like SRD because networking communications are constant. If you’re hoping that whatever your successor is will just magically sort things out you’re going to find a system that is quite busy with things that shouldn’t be taking so much time. The perceived gains in performance are going to evaporate.

The key to how SRD will actually work lies not in the protocol but in the way it’s implemented in hardware. SRD only works if you’re using an AWS Nitro DPU. When Amazon says they want the packets to be reassembled “up the stack” they’re really saying the want the DPU to do the work to put the pieces back together before presenting the reassembled packet back to the system. The system itself doesn’t know the packets came in out-of-order. The system doesn’t even know how the packets arrived. It just knows it sent data somewhere else and that it arrived with no errors.

The magic here is the DPU. TCP works for the purpose it was designed to do. If you’re transmitting packets over a wide area in serial fashion and there’s packet loss on the link TCP is the way to go. Amazon SRD only works with Nitro-configured systems in AWS. It’s a given that many servers are now in AWS and more than a few are going to have this additional piece of hardware installed and configured. The value is that having this enabled is going to increase performance. But there’s a cost.

I think back to configuring jumbo Ethernet frames for a storage array that I was configuring back in 2011. Enabling 9000 byte frames between the switch and the array really increased performance. However, if you plugged the array in anywhere else or plugged anything into that port on the switch it broke. Why? Because 9000 byte Ethernet isn’t the standard. It’s a special configuration that only works when explicitly enabled.

Likewise, SRD works well within Amazon. You need to specifically enable it on your servers and those performance gains aren’t going to happen if you need to talk to something that isn’t SRD-enabled or isn’t configured with a Nitro DPU. Because the hard work is being done by the DPU you don’t have to worry about a custom hardware configuration in your application ruining your ability to migrate. However, if you’re relying on a specific performance level with the app to make things happen, like database lookups or writing files to a SAN, you’re not going to be able to move that app to another cloud without incurring a performance penalty.

Tom’s Take

I get that companies like Amazon are heavily invested in Ethernet but want higher performance. I also get that DPUs are going to enable us to do things with systems that we’ve never even really considered before, like getting rid of TCP and optimizing how packets are sent through the network. It’s a grand idea but it not only breaks compatibility but creates a gap in performance expectations as well. We already see this in the wireless space. People want gigabit file transfers with Wi-Fi 6 and are disappointed because their WAN connection isn’t that fast. If Amazon is promising super fast communications between systems users are going to be disappointed when they don’t get that in other clouds or between non-DPU systems. It creates lock-in and sets expectations that can’t be met. The future of networking is going to involve DPUs. But in order to really change the way we do things we need to work with the standards we have and make sure everyone can benefit.

Time to Talk

It’s a holiday week here in the US so most people are working lighter days or just taking the whole week off. They’re looking forward to spending time with family and friends. Perhaps they’re already plotting their best strategy for shopping during Black Friday and snagging a new TV or watch. Whatever the case may be there’s lots things going on all over.

One thing that I feel needs to happen is conversation. Not just the kind of idle conversation that we make when we don’t know what to talk about. I also don’t mean the kinds of deep conversations that we need to prepare ourselves to have. I’m talking about the ones where we learn. The ones we have with friends and family where we pick up tidbits of stories and preserve them for the future.

It sounds rather morbid but these conversations aren’t going to be available forever. Our older loved ones are getting older every year. Time marches on and we never know when that time I going to run out. I have several friends that have lost loved ones this year and still others that have realized the time is growing shorter. Mortality is something that reminds us how important those experiences can be.

This year, talk to your friends. Listen to the stories of your family. Make the time to really hear them. That might mean turning off the football game or skipping that post-turkey nap. But trust me when I say that you’ll appreciate that time more when you realize you won’t have it any more.

Play To Your Team Strengths

This past weekend I went to a training course for an event that I’m participating in next year. One of the quotes that came up during the course was about picking the team that will help you during the event. The quote sounded something like this:

Get the right people on the right bus in the right seats and figure out where you want to go.

Sounds simple, right? Right people, right bus, right seats. Not everyone is going to be a good fit for your team and even if they are they may not be in the right position to do their best work. But how do you know what they’re good at?

Not-So-Well Rounded

Last night, I listened to this excellent Art of Network Engineering episode. The guest was a friend of mine in the industry, Mike Bushong (@MBushong). He’s a very talented person and he knows how to lead people. He’s one of the people that would love to work for given the opportunity. He’s also very astute and he has learned a lot of lessons about enabling people on a team.

One of the things he discussed in the episode was about people’s strengths. Not just things you’re good at but qualities you would say are your strongest assets. Aspects of you that you say are core to what you do. Maybe you’re good at writing emails but is that a strength? Or is concise communication your real strength? Are you good at seeing patterns? Or are you analytical? Every task you excel at isn’t a strength until you can see the underlying pieces that you really are good at doing.

One of the things that Mike brought up in the episode that really resonated with me was the idea that our performance evaluation system is really built around pointing out weaknesses and encouraging (or forcing) people to work on them. I can understand having people work on some core skills that are necessary in a business environment, such as time management or communications. You have to be good at those in order to succeed in your career.

However, having people pick up new skills or focus on aspects of what they do almost in a vacuum doesn’t really help as much as managers might think it does. Could you imagine telling someone like Steph Curry he needs to work on his dunking skills? Or perhaps telling someone like Steven Tyler that, while his singing is great, he really should spend more time playing the drums to be a more well-rounded band member?

The idea of telling professionals to concentrate on something other than their strength is ludicrous. Albert Pujols isn’t going to be the base stealing king for a baseball team. His strength is hitting home runs. Why try to make him into someone that runs instead of someone that hits? Yet when you ask a manager about reviewing someone’s performance they’ll tell you they need that team member to be well-rounded or they need to pick up this other skill that isn’t necessarily their strength but is needed.

I’m a decent writer. I’ve found over the past fifteen years that one of my strengths is in written communications. I can distill information and convey it to others through written words. I’ve managed to adapt that skill into being good at public speaking and giving presentations. What I would not be good at doing is being a full-time project manager with responsibilities for managing timelines for multiple teams. It’s because I realize that I need to be good enough at time management to get my work done but it is by no means a strength for me. For my manager to tell me to spend more time focusing on my weakness and less on my strength is doing me and the team a disservice.

Strong Bus

When you’re putting together your team, you need the right people on the bus. That part is pretty easy, right? But if you don’t know what they’re good at you don’t know if they’re the right people for your bus. If you have a team of people that are very detail oriented do you need to add another detail oriented person to do design work? Or would it be better to have someone with a strength in creativity? If you already have a group introverts that work best in isolation do you want to add another? Or do you need an outgoing personality in a role that interfaces with other teams or customers?

When you’re putting the right people in the right seats on the bus, make sure you’re looking at playing to their strengths instead of forcing them into a role to help them grow. Most people love a good challenge or like to step outside of their comfort zone now and then. However not everyone is able to take on a role that is something they consider a weakness. Instead of pushing people to the point of being uncomfortable and making them the wrong person in the wrong seat we need to help them succeed. An example might be having someone with limited experience creating marketing material perhaps working to come up with written pieces of the deliverables instead of the entire design. Or maybe having them do a single flyer or handout instead of the entire job. Giving them the opportunity to stretch a little and succeed is a much better way to help them in their career instead of throwing them in the deep end of the pool and hoping they can swim.

Tom’s Take

You should listen to the entire podcast episode with Mike Bushong. It encapsulates why he’s such a well-liked and effective leader. He knows his limitations and he works to overcome them. He identifies the right people for his team and he puts them in the places they can make the most impact to get things accomplished. He knows that a good leader makes the people around them better by enabling them to succeed. Some of these lessons are things that I’ve learned over the past few years through Wood Badge and the way he’s phrased them helped me internalize them a bit better. Play to people’s strengths and you’ll be happily surprised at how far they can go with you.

Continuity is Not Recovery

It was a long weekend for me but it wasn’t quite as long as it could have been. The school district my son attends is in the middle of a ransomware attack. I got an email from them on Friday afternoon telling us to make sure that any district-owned assets are powered off until further notice to keep our home networks from being compromised. That’s pretty sound advice so we did it immediately.

I know that the folks working on the problem spent the whole weekend trying to clean it up and make sure there isn’t any chance of getting reinfected. However, I also wondered how that would impact school this week. The growing amount of coursework that happens online or is delivered via computer is large enough that going from that to a full stop of no devices is probably jarring. That got me to thinking once more about the difference between continuity and recovery

Keeping The Lights On

We talk about disaster recovery a lot. Backups of any kind are designed to get back what was lost. Whether it’s a natural disaster or a security incident you want to be able to recover things back to the way they were before the disaster. We talk about making sure the data is protected and secured, whether from attackers or floods or accidental deletion. It’s a sound strategy but I feel it’s a missing a key component.

Aside from getting your data back, which is called the recovery point objective (RPO), you also need to consider how long it’s going to take to get you there. That’s called the recovery time objective (RTO). RTO tells you how long it will be until you can get your stuff back. For a few files the RTO could be minutes. For an entire data center it could be weeks. The RTO can even change based on the nature of the disaster. If you lose power to the building due to a natural disaster you may not even be able to start recovery for days which will extend the RTO due to circumstances outside your control.

For a business or organization looking to stay up and running during a disaster, RTO is critical but so too is the need for business continuity. How critical is it? The category was renamed to “Disaster Recovery and Business Continuity” many years ago. It’s not enough to get your data back. You have to stay up and running as much as possible during the process. You’ve probably experienced this if you’ve ever been to a store that didn’t have working registers or the ability to process credit cards. How can you pay for something if you can’t ring it up or process a payment option?

Business continuity isn’t about the data. It’s about keeping the lights on while you recover it. In the case of my son’s school they’re going to teach the old fashioned way. Lectures and paper are going to replace videos and online quizzes. Teachers are thankfully very skilled in this manner. They’ve spent hundreds if not thousands of hours in a classroom instructing with a variety of techniques. Are your employees equally as skilled when everything goes down? Could they get the job done if your Exchange Server goes down or they’re unable to log into Salesforce?

Back to Good, Eventually

In order to make sure you have a business left to recover you need to have some sort of a continuity plan. Especially in a world where cyberattacks are common you need to know what you have to do to keep things going while you work on fixing the damage. Most bad actors are counting on you not being able to conduct business as a driver to pay the ransom. If you’re losing thousands of dollars per minute you’re more likely to cave in and pay than try to spend days or weeks recovering.

Your continuity plan needs to exist separately from your backup RTO objectives. It may sound pessimistic but you need to have a plan for what happens if the RTO is met but also one for what happens if you miss your RTO. You don’t want to count on a quick return to normal operations as your continuity plan only to find out you’re not going to get there.

The other important thing to keep in mind is that continuity plans need to be functional, not perfect. You use the systems you use for a reason. Credit card machines make processing payments quick and easy. If they’re down you’re not going to have the same functionality. Yes, using the old manual process with paper slips and carbon copies is a pain and takes time. It’s also the only way you’re going to be able to take those payments when you can’t use the computer.

You also need to plan around how to handle your continuity plan. If you’re suddenly using more paper, such as invoices or credit card slips, where do you store those? How will you process them once the systems come back online? Will you need to destroy anything after it’s entered? Does that need to happen in a special way? All of these questions should be asked now so there is time to debate them instead of waiting until you’re in the middle of a disaster to solve them.

Tom’s Take

Disasters are never fun and we never really want them to happen. However we need to make sure we’re ready when they do. You need to have a plan for how to get everything back as well as how to keep doing everything you can until that happens. You may not be able to do 100% of the things you could before but if you don’t try to at least do some of them you’re going to lose a lot more in the long run. Have a plan and make sure everyone knows what to do when disaster strikes. Don’t count on getting everything back as the only way to recovery.