About networkingnerd

Tom Hollingsworth, CCIE #29213, is a former network engineer and current organizer for Tech Field Day. Tom has been in the IT industry since 2002, and has been a nerd since he first drew breath.

Who Pays The Price of Redundancy?

No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.

One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.

Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.

Backup Plans

I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.

What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.

Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.

Extending the Backup Bubble

What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?

There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?

You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.

The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.

However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.

Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.


Tom’s Take

I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.

Tom’s Virtual Corner at Cisco Live Global 2021 – Anniversary

We made it through the year that was March 2020. Here were are on the other side trying to find out whatever this normal is supposed to look like. We’re not out of the woods yet but we do know that things aren’t going to be back to the way they were any time soon. That includes the events that we enjoyed traveling to and hanging out at.

Cisco Live has made the decision to go virtual again this year. One can’t blame them to be honest. Travel uncertainty and the potential liability of having a huge event just didn’t make sense. If you thought the old Conference Crud was bad you really don’t want this new-and-improved version! Cisco has also decided that one global event makes more sense than several events scattered across the calendar. That means that Cisco Live Europe and Cisco Live US are now global and happening at the end of March instead of January or June.

With the announcement that everything will be virtual again this year it also means that the social aspect of the event is going to be virtual as well. As much as we would have liked to hang out at Tom’s Corner in Las Vegas and catch up it just won’t be happening. I am a bit disappointed because this year was going to be the tenth anniversary of Tom’s Corner. It’s been a decade since I plopped down in an uncomfortable chair and set out on a quest to meet people. I was hoping to bring back the stool and the table and the corner live in person but it appears that we’ll have to do it some other time.

In the interim, we’re still going to be getting together in just a few weeks! We’re going to be hosting a Webex call just like we did last year so we can all hang out and chat throughout the day. Here are the details:

Date: March 29, 2021

Time: 7am PT – Whenever we get bored!

Where: Virtually via Webex (link will be in the calendar invite)

If you want to take part in the virtual corner, you need to either send an email to tom@networkingnerd.net or send a DM/PM/RRM/DMVPN to me on social media with the email address you want your calendar invitation sent to. If there is no email address I can’t send out the invitation! We’ll send them out as we can to get everyone signed up. You can pop in and out whenever you want. We’ll try to create breakout rooms for those that just want to chat on specific topics and such. It’ll be fun! Provided, that is, that we all keep to some basic ground rules:

  • Disruptive attendees may be removed at the discretion of the hosts.
  • Wheaton’s Law applies. If you think you’re violating it, you probably are.
  • Be respectful of the community. We want this to be positive for everyone and that means everyone is responsible for keeping it that way.

If you have any questions about Tom’s Corner or things in general just ask! I want to thank Gestalt IT for allowing us to host this meeting and continuing to support community engagement and involvement. A huge part of my job is doing things like this and I still continue to be thrilled to have the support of Stephen Foskett, Claire Chaplais, and the rest of the Gestalt IT team when it comes to taking part of my day to interact with you all, virtual or otherwise.

I wish things could have been different as far as Cisco Live this year but we’re going to make the most of what we have. That means virtual for now but rest assured the the next time we can all meet in person we’re going to have a birthday party for Tom’s Corner that will be sung of in legends to come!

Building Snowflakes On Purpose

We all know that building snowflake networks is bad, right? If it’s not a repeatable process it’s going to end up being a problem down the road. If we can’t refer back to documentation to shows why we did something we’re going to end up causing issues and reducing reliability. But what happens when a snowflake process is required to fix a bigger problem? It’s a fun story that highlights where process can break down sometimes.

Reloaded

I’ve mentioned before that I spent about six months doing telephone tech support for Gateway computers. This was back in 2003 so Windows XP was the hottest operating system out there. The nature of support means that you’re going to be spending more time working on older things. In my case this was Windows 95 and 98. Windows 98 was a pain but it was easy to work on.

One of the most common processes we had for Windows 98 was a system reload. It was the last line of defense to fix massive issues or remove viruses. It was something that was second nature to any of the technicians on the help desk:

  1. Boot from the Gateway tools CD and use GWSCAN to write zeros to the hard drive.
  2. Reboot from the CD and use FDISK to partition the hard disk.
  3. Format the drive.
  4. Insert the Windows 98 OS CD and copy the CAB installation files to a folder on the hard drive.
  5. Run setup from the hard drive and let it complete.
  6. When Windows comes back up, insert the driver CD and let install all the drivers.

The whole process took two or three phone calls to complete. Any time you got to something that would take more than fifteen minutes to complete you logged the steps in the customer trouble ticket and had them call back when it was completed. The process was so standard that it had its own acronym in the documentation – FFR, which stood for “FDISK, Format, Reload”. If you told someone where you were in the process they could finish it no problem.

Me, Me, ME

The whole process was manual with lots of steps and could intimidate customers. At some point in the development process the Gateway folks came up with a solution for Windows ME that they thought worked better. Instead of the manual steps of copying files and drivers and such, Gateway loaded the OS CD with a copy of the image they wanted for their specific model type. The image was installed using ImageCast, an imaging program that just dropped the image down on the drive without the need to do all the other steps. In theory, it was simple and reduced call times for the help desk.

In practice, Windows ME was a disaster to work on. The ImageCast program worked about half the time. If you didn’t pick the right options in the reload process it would partition the hard drive and add a second clean copy of WinME without removing the first one. It would change the MBR to have two installations to choose from with the same identifiers so users would get confused as to which was which. And the image itself seemed to be missing programs and drivers. The fact that there was a Driver CD that shipped with the system made us all wonder what the real idea behind this “improved” process was.

Because Windows ME was such a nightmare to reload, our call center got creative. We had a process that worked for Windows 98. We had all the files we needed on the disks. Why not do it the “right” way? So we did. Informally, the reload process for Windows ME was the same as Windows 98. We would FFR Windows ME boxes and make sure they looked right when they came back. No crazy ImageCasting programs or broken software loads.

The only issue? It was an informal snowflake process. It worked much better but if someone called the main help desk number and got another call center they would be in the middle of an unsupported process. The other call center tech would simply start the regular process and screw up the work we’d done already. To counter that, we would tell the customer to call our special callback voicemail box and not do anything until we called them back. That meant the reload process took many more hours for an already unhappy customer. The end result was better but could lead to frustrations.

Let It Snow

Was our informal snowflake process good or bad? It’s tough to say. It led to happier customers. It meant the likelihood of future support calls was lower because the system was properly reloaded instead of relying on a broken image. However, it was stressful for the tech that worked on the ticket because they had to own the process the whole way through. It also meant that you had to ensure the customer wouldn’t call back and disrupt the process with someone else on the phone.

The process was broken and it needed to be fixed. However, the way to fix the broken process wasn’t easy to figure out. The national line had their process and they were going to stick to it. We came up with an alternative but it wasn’t going to be adopted. Yet, we still kept using our process as often as possible because we felt we were right.

In your enterprise, you need to understand process. Small companies need processes that are repeatable for the ease of their employees. Large companies need processes to keep things consistent. Many companies that face regulatory oversight need processes followed exactly to ensure compliance. The ability to go rogue and just do it the way you want isn’t always desired.

If you have users that are going around the process you need to find out why. We see it all the time. Security rules that get ignored. Documentation requirements that are given the bare minimum effort. Remember the shutdown dialog box in Windows Server 2003? Most people did a token entry before rebooting. And having eighteen entries of “;lkj;kljl;k” doesn’t help figure out what’s going on.

Get your users to tell you why they need a snowflake process. Especially if there’s more than one person relying on it. The odds are very good that they’ve found hiccups that you need to address. Maybe it’s data entry for old systems. Perhaps it’s problem with the requirements or the order of the steps. Whatever it is you have to understand it and fix it. The snowflake may be a better way to do things. You have to investigate and figure it out otherwise your processes will be forever broken.


Tom’s Take

Inbound tech support is full of stories like these. We find a broken process and we go around it. It’s not always the best solution but it’s the one that works for us. However, building a toolbox full of snowflake processes and solutions that no one else knows about is a path to headaches. You need to model your process around what people need to accomplish and how they do it instead of just assuming they’re going to shoehorn their workflow into your checklist. If you process doesn’t work for your people, you’d better get a handle on the situation before you’re buried in a blizzard of snowflakes.

Tech Field Day Changed My Life

It’s amazing to me that it’s been ten years since I attended by first Tech Field Day event. I remember being excited to be invited to Tech Field Day 5 and then having to rush out of town a day early to beat a blizzard to be able to attend. Given that we just went through another blizzard here I thought the timing was appropriate.

How did attending an industry event change my life? How could something with only a dozen people over a couple of days change the way I looked at my career? I know I’ve mentioned parts of this to people in the past but I feel like it’s important to talk about how each piece of the puzzle built on the rest to get me to where I am today.

Voices Carry

The first thing Tech Field Day did to change my life was to show me that I mattered. I grew up in a very small town and spent most of my formative school years being bored. The Internet didn’t exist in a usable form for me. I devoured information wherever I could find it. And I languished as I realized that I needed more to keep learning at the pace I wanted. When I finally got through college and started working in my career the same thing kept happening. I would learn about a subject and keep devouring that knowledge until I exhausted it. Yet I still wanted more.

Tech Field Day reinforced that my decision to start a blog to share what I was learning was the right one. It wasn’t as much about the learning as it was the explanation. Early on I thought a blog was just about finding some esoteric configuration stanza and writing about it. It wasn’t until later on that I figured out that my analysis and understanding and explanation was more important overall. Even my latest posts about more “soft skill” kinds of ideas are less about the ideas and how I apply them.

Blogging and podcasting are just tools to share the ideas that we have. We all have our own perspectives and people enjoy listening to those. They may not always agree. They may have their own opinions that they want to share. However, the part that is super critical is that everyone is able to share in a place where they can be discussed and analyzed and understood. As long as we all learn and grow from what we share then the process works. It’s when we stop learning and sharing and try to protest that our way is right and the only way that we stop growing.

Tech Field Day gave me the platform to see that my voice mattered and that people listened. Not just read. Not just shared. That they listened and that they wanted to hear more. People started asking me to comment on things outside of my comfort zone. Maybe it was wireless networking. It could have been storage or virtualization or even AI. It encouraged me to learn more and more because who I was and what I said was interesting. The young kid that could never find someone to listen when I wanted to talk about Star Wars or BattleTech or Advanced Dungeons and Dragons was suddenly the adult that everyone wanted to ask questions to. It changed the way I looked at how I shared with people for the better.

Not Just a Member, But the President

The second way Tech Field Day changed my life was when I’d finally had enough of what I was doing. Because of all the things that I had seen in my events from 2011 to 2013, I realized that working as an engineer and operations person for a reseller had a ceiling I was quickly going to hit. The challenges were less fun and more frustrating. I could see technology on the horizon and I didn’t have a path to get to a place to implement it. It felt like watching something cool happening outside in the yard while I was stuck inside washing the dishes.

Thankfully, Stephen Foskett knew what I needed to hear. When I expressed frustration he encouraged me to look around for what I wanted. When I tried to find a different line of work that didn’t understand why I blogged, it crystallized in me that I needed something very different from what I was doing. Changing who I was working for wasn’t enough. I needed something different.

Stephen recognized that and told me he wanted me to come on board without him. No joking that my job offer was “Do you want to be the Dread Pirate Roberts? I think you’d make an excellent Dread Pirate.”. He told me that it was hard work and unlike anything I’d ever done. No more CLI. No more router installations. In place of that would be event planning and video editing and taking briefings from companies all over the place about what they were building. I laughed and told him I was in.

And for the past eight years I’ve been a part of the thing that showed me that my voice mattered. As I learned the ropes to support the events and eventually started running them myself, I also grew as a person in a different way. I stopped by shy and reserved and came out of my shell. When you’re the face of the event you don’t have time to be hiding in the corner. I learned how to talk to people. I also learned how to listen and not just wait for my turn to talk. I figured out how to get people to talk about themselves when they didn’t want to.

Now the person I am is different from the nerdy kid that started a blog over ten years ago. It’s not just that I know more. Or that I’m willing to share it with people. It has now changed into getting info and sharing it. It’s about finding great people and building them up like I was built up. Every time I see someone come to the event for the first time I’m reminded of me all those years ago trying to figure out what I’d gotten myself into. Watching people learn the same things I’ve learned all over again warms my heart and shows me that we can change people for the better by showing them what they’re capable of and that they matter.


Tom’s Take

Tech Field Day isn’t an event of thousands. It’s personal and important to those that attend and participate. It’s not going to stop global warming or save the whales. Instead, it’s about the people that come. It’s about showing them they matter and that they have a voice and that people listen. It’s about helping people grow and become something they may not even realize they’re capable of. I know I sound biased because the pay the bills but even if I didn’t work there right now I would still be thankful for my time as a delegate and for the way that I was able to grow from those early days into a better member of the community. My life was changed when I got on that airplane ten years ago and I couldn’t be happier.

Solutions In Search of a Problem

During a few recent chats with my friends in the industry, I’ve heard a common refrain coming up about technologies or products being offered for sale. Typically these are advanced ideas given form that are then positioned as products for sale in the market. Overwhelmingly the feedback comes down to one phrase:

This is a solution in search of a problem.

We’ve probably said this a number of times about a protocol or a piece of hardware. Something that seems to be built to solve a problem we don’t have and couldn’t conceive of. But why does this seem to happen? And what can we do to fix this kind of mentality?

Forward Looking Failures

If I told you today that I was creating software that would revolutionize the way your autonomous car delivers music to the occupants on their VR headsets you’d probably think I was crazy, right? Every one of the technologies I mentioned in the statement is a future thing that we expect may be big down the road. We love the idea of autonomous vehicles and VR headsets and such.

Now, let’s change the statement. I’m working on a new algorithm for HD-DVD players to produce better color accuracy on plasma TVs that use PowerPC CPUs. Hopefully that statement had you giggling a little no matter what your tech level. What’s the difference? Well, that statement was loaded with technology that no one uses any more. HD-DVD lost a format war against Blu-Ray. Plasma TVs are now supplanted by LCD, LED, and even more advanced things. PowerPC has been replaced with RISC architecture and more modern takes on efficient CPUs in mobile devices.

If you’d have bet on the second combination of things back in the heyday of those technologies you might have made yourself a bit of money. You’d ultimately find yourself without a product to sell now, though. Because technology always changes. Even the dominant form of tech eventually goes away. Blu-Ray may have beat HD-DVD but it couldn’t stop streaming services. LCD replaced plasma but now we’re moving beyond that tech into OLED and even more advanced stuff. You can’t count on tech staying the same.

Which leads to the problem of trying to create solutions for problems that haven’t happened yet or are so far out on the horizon that you may not be able to create a proper solution for it. Maybe VR headsets will have great software that doesn’t need a new music match algorithm. Maybe the passengers in your autonomous vehicle won’t wear VR headsets. Perhaps music as we know it will change and not even be as relevant in the future. There’s no telling which butterfly effects will impact what you’re trying to accomplish.

Solve the Easy Things

Aside from the future problems you hope to be solving with your fancy new product you also have to take into account human behavior. Are people more likely to buy something to solve an issue they don’t currently have? Or are they more apt to buy something to solve a problem they have now? Startups that are looking five years into the future are going to stumble over the problems people have today on their way to the perfect answer to a question no one has asked yet.

I wanted a tablet because it was cool when they first came out. After using one for a few weeks I realized that it was a solution that didn’t address my pressing issues. I didn’t need what it offered at the time. Today a tablet solves many other issues that have come up since then, such as note taking or having quick access to information away from my desk. However, those problems needed to develop over time instead of hoping that my solution would work for something I couldn’t anticipate. I didn’t need a word processor for my tablet because I wouldn’t by typing much with an on-screen keyboard. Today I write a lot on my tablet because of the convenience factor. I also take notes because I have a pencil to write with instead of my fingers.

Solving problems people have right now is a sure fire way to make your customers happy and give you the breathing room to look to the future. How many times have you seen a startup with a great idea that ends up building something mundane because they can’t build the first thing right or they realize the market isn’t quite there yet?

I can remember specifically talking to Guardicore when they were first out of stealth and discussing how their SDN-based offensive security systems worked. It was amazing stuff with very little market. When they looked around and realized they needed to switch it up they went full-on into zero trust security and microsegementation. They took something that could be a great solution later on and pivoted to solving problems that people have right now. The result is a healthy company that makes things people want to buy instead of trying to sell them a solution for a problem they may never have.

If you are looking at the market and thinking to yourself, “I need to build X because it will revolutionize the way we do things” stop and ask yourself how we get there. What steps need to be taken? Who will buy it and when? Are there problems along the way? If the answer to the last question is anything other than “no” you need to focus on those problems first. You may find that you don’t need to build your fancy new vision of perfect future success because you solved all the other problems people needed fixed first. Your development efforts will be rewarded with customers and income instead of the perfect solution no one wants to buy.


Tom’s Take

Solutions without problems to solve are a lot like one-off kitchen gadgets. I may have a use for an avocado slicer twice a year. I also have a knife that does the exact same thing a little slower that I can use for many other problems around my house. I don’t need the perfect avocado slicing solution for the future when I’m making guacamole and avocado toast every day. I need a solution that gets my problems of slicing, chopping, dicing, and cutting done today. Technology is no different. Build what solves problems now and you’ll be a success. Build for the future if and only if you have the disposable time and income to get there.

Friction Finders

Do you have a door that sticks in your house? If it’s made out of wood the odds are good that you do. The kind that doesn’t shut properly or sticks out just a touch too far and doesn’t glide open like it used to. I’ve dealt with these kinds of things for years and Youtube is full of useful tricks to fix them. But all those videos start with the same tip: you have to find the place where the door is rubbing before you can fix it.

Enterprise IT is no different. We have to find the source of friction before we can hope to repair it. Whether it’s friction between people and hardware, users and software, or teams going at each other we have to know what’s causing the commotion before we can repair it. Just like with the sticking door, adding more force without understand the friction points isn’t a long-term solution.

Sticky Wickets

Friction comes from a variety of sources. People don’t understand how to use a device or a program. Perhaps it’s a struggle to understand who is supposed to be in charge of a change control or a provisioning process. It could even be as simple as someone applying outside issues to their regular day and causing problems because their interactions with previously stable systems is stilted.

Whatever the reasons for the friction, we need to understand what’s going on before we can start fixing. If we don’t know what we’re trying to solve we’re just going to throw solutions at the wall until something sticks. Then we hope that was the fix and we move on. Shotgun troubleshooting is never a permanent solution to anything. Instead, we need to take repeatable, documented steps to resolve the points of friction.

Ironically enough, the easiest issue to solve is the interpersonal kind. When teams argue about roles or permissions or even who should have the desks at the front of the data center it’s almost always a problem of a person against another person. You can’t patch people. You can’t upgrade people. You can’t even remove people and replace them with a new model (unless it’s a much bigger issue and HR needs to solve it). Instead, it’s a chance to get people talking. Be productive and make sure that everyone knows the outcome is the resolution of the problem. It’s not name calling or posturing. Lay out what the friction point is and make people talk about that. If you can keep them focused on the problem at hand and not at each others’ throats you should be able to get everyone to recognize what needs to change.

Mankind Versus Machines

People fighting people is a people problem. But people against the enterprise system isn’t always a cut-and-dried situation. That’s because machines are predictable in so many different kinds of ways. You can be sure that what you’re going to get is the right answer every time but you may not be able to ask the right questions to find that answer the way you want to.

Think back to how many times you’ve diagnosed a problem only to hit a wall. Maybe it’s that the CPU utilization on a device is higher than it should be. What next? Is it some software application? A bug in the system? Maybe a piece of malware running rampant? If it’s a networking device is it because of a packet flow causing issues? Or a failed table lookup causing a process switch to happen? There are a multitude of things that need to be investigated before you can decide on a course of action.

This is why shotgun troubleshooting is so detrimental to reducing friction. How do we know that the solution we tried isn’t making the problem worse? I wrote over ten years ago about removing fixes that don’t address the problem and it’s still very true today. If you don’t back out the things that didn’t address the issue you’re not only leaving bad fixes in place but you are potentially causing problems down the line when those changes impact other things.

Finding the sources of friction in your systems takes in-depth troubleshooting. Your users just want the problem to go away. They don’t want to answer forty questions about when it started or what’s been installed. They don’t want the perfect solution that ensures the problem never comes back. They just want to be able to work right now and then keep moving forward. That means you need a two-step process of triage and investigation. Get the problem under control and then investigate the friction points. Figure out how it happened and fix it after you get the users back up and running.

Lastly, document the issue and what resolved it. Write it all down somewhere, even if it’s just in your own notes. But if you do that, make sure you have a way of indexing everything so you can refer back to it at some point in the future. Part of the reason why I started this blog was to write down solutions to problems I discovered or changed along the way to ensure that I could always look them up. If you can’t publish them on the Internet or don’t feel comfortable writing it all up, at least use a personal database system like Notion to keep it all in one searchable place. That way you don’t forget how clever you are and go back to reinventing the wheel over and over again every time the problem comes up.


Tom’s Take

Friction exists everywhere. Sometimes it’s necessary to keep us from sliding all over the road or keeping a rug in place in the living room. In enterprise IT friction can create issues with system and teams. Reducing it as much as possible keeps the teams working together and being as productive as possible. You can’t eliminate it completely and you shouldn’t remove it just for the sake of getting rid of something. Instead, analyze what you need to improve or fix and document how you did it. Like any door repair, the results should be immediate and satisfying.

The Double-Edged Grindstone

Are you doing okay out there? I hope that you’re well and not running yourself thin with all the craziness still going on. Sometimes it seems like we can’t catch a break and that work and everything keep us going all the time. In fact, that specific feeling and the resulting drive around it is what I wanted to talk about today.

People have drive. We want to be better. We want to learn and grow and change. Whether it’s getting a faster time running a 5K or learning new skills to help our career along. Humans can do amazing things given the right motivation and resource availability. I know because I taught myself a semester of macroeconomics in a Waffle House the night before the final exam. Sure, I was groggy and crashed for a 10-hour nap after the final but I did pass!

It’s that kind of ability to push ourselves past our limits that both defines us and threatens to destroy us. I’m a huge fan of reading and fiction. Growing up I latched on to the Battletech novels, especially those written by Michael A. Stackpole. In his book Lost Destiny there is a great discussion about honing your skills and what it may end up costing you:

Were we to proceed, it would be the battle of the knife against the grindstone. Yes, we would get sharper, we would win great victories, but in the end, we would be ground away to nothing.

 

Running in Place

That quote hit home for me during my CCIE lab attempt training. I spent my free time after work labbing everything I could get my hands on. I would log on about 8pm every night after my eldest went to sleep and lab until midnight. Every night was the same chore. No time for television or reading anything other than a Cisco Press book. Instead, I drilled until I could provision EtherChannels in my sleep and could redistribute routes without a second thought in four different ways. I felt like I was getting so much accomplished!

I also felt tired all the time. I had no outlet to relax. I spent every waking minute of my free time focused on sharpening my skills. As above, you could practically hear the edge of the knife on the grindstone. I was razor-focused on completing this task. As I learned later in life, the joys of hyper focus in ADHD had a lot to do with that.

It wasn’t until I got through my lab that the true measure of this honing process hit home. I spent the weekend after my attempt hanging out with some friends away from technology and the whole time I felt unsettled. I watched horrible movies on the Sci-Fi channel and couldn’t get comfortable. I was antsy and wound up, even though I’d just completed a huge milestone! It wasn’t until about a week later that I was finally able to put a name to this anxiety. I listened to the little voice in my head repeating over and over again in my downtime: “You really should be studying something right now.”

I was in search of my next grindstone. My knife was sharp, but I knew it could be sharper still with the next certification or piece of knowledge. It took me a while before I could quiet that voice and focus on restoring some semblance of my life as I had known it before my lab attempts. Even now there are times when I feel like I should be studying or writing or creating something instead of unwinding with a book or a camping trip to the woods.

The rush of dopamine that we get from learning new things or performing skills to perfection cannot be understated. It makes us feel good. It makes us want to keep doing it to continue that stream of good feelings. We could focus on it to the detriment of our other hobbies or our social lives. And that was in the time before when we could go out whenever we wanted! In the current state of the pandemic it’s easy to get wrapped up in something without the ability to force yourself to walk away from it, as anyone with a half-filled room full of a new hobby can tell you.

Setting Goals with Limits

What’s the solution then? Do we just keep grinding away until there’s nothing left? Do we become the best left-handed underwater basket weaver that has ever existed? Do we keep forcing ourselves to run the same video game level over and over again until we’re perfect, even if that means not getting out of our house for weeks at a time? Can we do something repeatedly until we are destroyed by it?

The key is to set goals but to make them in such a way as to set limits on them. We do this all the time in the opposite direction. When we have a task we don’t like to do we force ourselves to do it for a set amount of time. Maybe it’s an hour of reconciling the checkbook or thirty minutes of exercise. But if it’s something you enjoy you have to set limits on it as well to avoid that burnout.

Find something you really like to do, such as reading a book. However, instead of devouring that book until it’s finished in one sitting and staying up until 4am to finish it, set an alarm to cut yourself off after an hour or two. Be honest with yourself. When the alarm goes off, stop reading and do something else. You could even pair it with a task you like less to use the little dopamine boost to help you through your other activity.

It’s not fun to stop doing something we like doing even when it’s something that is going to make us a better person or better employee. However, you’ll soon see that having that extra time to reinforce what you’ve learned or collect your thoughts does more for you that just binging something until you’re completely exhausted by it.

Remember when TV shows came out weekly? We had to wait until next Thursday for the new episode? Remember how much you looked forward to that day? Or maybe even the return of the new season in the fall? That kind of anticipation helps motivate you. Being able to consume all the content in one sitting for 8-10 hours leaves you feeling great at first. Later, when your dopamine goes back to normal you’re going to feel down. You’ll also realize you can’t get the same hit again because you have consumed your current resource of it. By limiting what you can do at one time you’re going to find that you can keep that great feeling going without burning yourself out.

Yes, this does absolutely apply to studying for things. Your brain needs time to lock in the knowledge. If you’ve ever tried to memorize something you know that you need to spend time thinking of something else and come back to that item before you really learn it. The only thing that transfers knowledge from short-term memory to long-term memory is time. There are no shortcuts. And the more you press the edge of the knife agains the stone the more you lose in the long run because the available resources are just gone.


Tom’s Take

I’m not qualified to delve into the psychoanalysis part of all this stuff. I just know how it works for me because that’s who I am. I can shut out the world and plow through something for hours at a time if I want. I’ve done it many times in my career with both work and personal tasks. But it’s taken a long time for me to finally realize that sprinting like that for extended periods of time will eventually wear you away to nothing. Even the best runners in the world need to rest. Even the smartest people in the industry need to not think about things for a while. You do too. Take some time today or tomorrow or even next week to set goals and limits for yourself. You’ll find that you enjoy the things you do and learn more with those limits in place and you’ll wind up a happier, healthier person. You’ll be sharp and ready instead of a pile of dust under the grindstone.

Planning For The Worst Case You Can’t Think Of

Remember that Slack outage earlier this month? The one that happened when we all got back from vacation and tried to jump on to share cat memes and emojis? We all chalked it up to gremlins and went on going through our pile of email until it came back up. The post-mortem came out yesterday and there were two things that were interesting to me. Both of them have implications on reliability planning and how we handle the worst-case scenarios we come up with.

It’s Out of Our Hands

The first thing that came up in the report was that the specific cause for the outage came from an AWS Transit Gateway not being able to scale fast enough to handle the demand spike that came when we all went back to work on the morning of January 4th. What, the cloud can’t scale?

The cloud is practically limitless when it comes to resources. We can create instances with massive CPU resources or storage allocations or even networking pipelines. However, we can’t create them instantly. No matter how much we need it takes time to do the basic provisioning to get it up and running. It’s the old story of eating an elephant. We do it one bite at a time. Normally we tell the story to talk about breaking a task down into smaller parts. In this case, it’s a reminder that even the biggest thing out there has to be dealt with in small pieces as well.

Slack learned this lesson the hard way. Why? Because they couldn’t foresee a time when their service was so popular that the amount of traffic rushing to their servers crushed the transit gateways. Other companies have had to learn this lesson the hard way too. Disney crawled on launch day because of demand. The release of a new game that requires downloading and patching on Day One also puts stress on servers. Even the lines outside of department stores on Black Friday (in less pandemic-driven years) are examples of what happens when capacity planning doesn’t meet demand.

When you plan for your worst case scenario, you have to think the unthinkable. Instead of asking yourself what might happen if everyone logs on at the same time you also need to ask when happens if they try and something goes wrong. I spent a lot of time in my former job thinking about simple little exercises like VDI boot storms, where office workers can push a storage system to the breaking point by simply turning all their machines on at the same time. It’s the equivalent of being on a shared network resource like a cable modem during the Super Bowl. There aren’t any resources available for you to use.

When we plan for capacity, we have to realize that even our most optimistic projections of usage are going to be conservative if we take off or go viral. Rather than guessing what that might be, take an hour every six months and readjust your projections. See how fast you’re growing. Plan for that crazy scenario where everyone decides to log on at the same time on a day where no one has had their coffee yet. And be ready for what happens when someone throws a wrench into the middle of the process.

What Happens When It Goes Wrong?

The second thing that came up in the Slack post-mortem that is just as worrisome was the behavior of the application when it realized there was a connection timeout. The app started waiting for the pathway to be open again. And guess what happened when AWS was able to scale the transit gateway? Slack clients started hammering the servers with connection requests. The result was something akin to a Distributed Denial-of-Service (DDoS) attack.

Why would the Slack client do this? I think it has something to do with the way the developers coded it and didn’t anticipate every client trying to reconnect all at once. It’s not entirely unsound thinking to be honest. How could we live in a world where every Slack user would be disconnected all at once? Yet, we do and it did and look what happened?

Ethernet figured this part out a long time ago. The CSMA/CD method for detecting collisions on a layer 2 connection has an ingenious solution for what happens when a collision is detected. Once it realizes that there was a problem on the wire it stops what is going on and calculates a random backoff timer based on the number of detected collisions. Once that timer has expired it attempts to transmit again. Because there has to be another station involved in a collision incident both stations do this. The random element of the timer calculation ensures that the likelihood of both stations choosing to transmit again at the same time is very, very low.

If Ethernet behaved like the Slack client did we would never resolve collisions. If every station on a layer 2 network immediately tried to retransmit without a backoff timer the bus would be jammed constantly. The architects of the protocol figured out that every station needs a cool off period to clear the wire before trying again. And it needs to be different for every station so there is no overlap.

Slack really needs to take this idea into account. Rather than pouncing on a connection as soon as it’s available there needs to be a backoff timer that prevents the servers from being swamped. Even a few hundred milliseconds per client could have prevented this outage from getting as big as it did. Slack didn’t plan beyond the worst case scenario because they never conceived of their worst case scenario coming to pass. How could it get worse than something we couldn’t imagine happening?


Tom’s Take

If you design systems or call yourself a reliability engineer, you need to develop a hobby of coming up with disastrous scenarios. Think of the worst possible way for something to fail. Now, imagine it getting worse. Assume that nothing will work properly when there is a recovery attempt. Plan for things to be so bad that you’re in a room on fire trying to put everything out while you’re also on fire. It sounds very dramatic but that’s how bad it can get. If you’re ready for that then nothing will surprise you. You also need to make sure you’re going back and thinking of new things all the time. You never know which piece is going to fail and how it will impact what you’re working on. But thinking through it sometimes gives you an idea of where to go when it all goes wrong.

Managing Leaders, Or Why Pat Gelsinger Is Awesome

In case you missed it, Intel CEO Bob Swan is stepping down from his role effective February 15 and will be replaced by current VMware CEO Pat Gelsinger. Gelsinger was the former CTO at Intel for a number of years before leaving to run EMC and VMware. His return is a bright spot in an otherwise dismal past few months for the chip giant.

Why is Gelsinger’s return such a cause for celebration? The analysts that have been interviewed say that Intel has been in need of a technical leader for a while now. Swan came from the office of the CFO to run Intel on an interim basis after the resignation of Brian Krzanich. The past year has been a rough one for Intel, with delays in their new smaller chip manufacturing process and competition heating up from long-time rival AMD but also from new threats like ARM being potentially sold to NVIDIA. It’s a challenging course for any company captain to sail. However, I think one key thing makes is nigh impossible for Swan.

Management Mentality

Swan is a manager. That’s not meant as a slight inasmuch as an accurate label. Managers are people that have things and look after them. Swan came from the financial side of the house where you have piles of resources and you do your best to account for them and justify their use. It’s Management 101. Managers make good CEOs for a variety of companies. They make sure that the moves are small and logical and will pay off in the future for the investors and eventually the workers as well. They are stewards first and foremost. When their background comes from something with inherent risk they are especially stewardly.

You know who else was a manager? John Sculley, the man who replaced Steve Jobs at Apple back in 1983. Sculley was seen as a moderating force to Jobs’ driving vision and sometimes reckless decision making skills. Sculley piloted the ship into calm waters at first but was ultimately sent packing because his decisions were starting to make less and less sense, such as exploring options to split Apple into separate companies and taking on IBM head-to-head on their turf.

Sculley was ousted and Jobs returned to Apple in 1993. It wasn’t easy at first but eventually the style of Jobs started producing results. Things like the iPod, iMac, and eventually the iPhone came from his vision. He’s a leader in that regard. Leaders are the ones that jump out and take risks to make big results. Leaders are people like John Kennedy that give a vision of going to the moon in a decade without the faintest idea how that might happen. Leadership is what drives companies.

Leaders, however, are a liability without managers. Leaders say “let’s go to the moon!” Managers sit down and figure out how to make that happen without breaking the budgets or losing too many people along the way. Managers are the grounded voices that guide leaders. Without someone telling a leader of the challenges to overcome they won’t see the roadblocks until the drive right into them.

Leaders without brakes on their vision have no reality to shape it. Every iMac has an Apple Lisa. Every iPod has the iPod Hi-Fi. Even the iPhone wasn’t the iPhone until the App Store came around against the original vision of Apple’s driving force. To put it another way, George Lucas is a visionary leader in filmmaking. However, when he was turned loose without management of his process we ended up with the messy prequel trilogy. Why was Empire Strikes Back such a good film? Because it had people like Lawrence Kasdan involved managing the process of Lucas creating art. They helped focus the drive of a leader and make the result something great.

Tech Leadership

Let’s bring this discussion back to Intel and Pat Gelsinger. I know he is the best person to lead Intel right now. I know that because Gelsinger is very much a tech leader. He has visions for how things need to be and he can see how to get there. He knows that reducing costs and reaving product lines at Intel isn’t going to make them a better company down the road no matter what the activist investors have to say on the matter. They may have wanted regime change when they petitioned the board back in December, but they may find the new king a bit harder to deal with.

Gelsinger is also a manager. Going from CTO to being COO at EMC and eventually CEO at VMware has tempered his technical chops. You can’t hope to run a company on crazy ideas and risky bets. Steve Jobs had people like Tim Cook in the background keeping him as grounded in reality as possible. Gelsinger picked up these skills in helming VMware and I think that’s going to pay off for him at Intel. Rather than running out to buy another company to augment capabilities that will never see the light of day, someone like him can see the direction that Intel needs to go and make it happen in a collected manner. No more FPGA acquisitions that never bear fruit. No more embarrassing sales of the mobile chip division because no one could capitalize on it.

Pat Gelsinger is the best kind of technical manager. I saw it in the one conversation I was involved in with him during an event. He stepped in to a talk between myself and a couple of analysts. He listened to them and to me and when he was asked for his opinion, he stopped for a moment to think. He asked a question to clarify and then gave his answer. That’s a tempered leader approach to things. He listened. He thought. He clarified. And then he made a decision. That means there is steel behind the fire. That means the driving factors of the decision-making process aren’t just “cool stuff” or “save as much money as we can”. What will happen is the fusion of the two that the company needs to stay relevant in a world that seems bent on passing it by.


Tom’s Take

I’ve worked for managers and I’ve worked for leaders. I don’t have a preference for one or the other. I’ve seen leaders sell half their assets to save their company. I’ve also seen them buy ridiculous stuff in an effort to build something that no one would buy. I’ve seen managers keep things calm in the middle of a chaotic mess. I’ve also seen them so wracked with indecision that the opportunities they needed to capitalize on sailed off into the sunset. If you want to be the best person to run a company as the CEO, whether it’s a hundred people or a hundred thousand, you should look to someone like Pat Gelsinger. He’s the best combination of a manager and leader that I’ve seen in a long time. In five years we will be talking about how he was the one to bring Intel back to the top of the mountain, both through his leadership and his management skills.

Building Backdoors and Fixing Malfeasance

You might have seen the recent news this week that there is an exploitable backdoor in Zyxel hardware that has been discovered and is being exploited. The backdoor admin account with the clever name ‘zyfwp’ is not something that has been present in the devices forever. The account was put in during firmware version 4.60, which was released in Q4 2020.

Zyxel is rushing to patch the devices and remove the backdoor account. Users are being advised to disable remote administration until the accounts can be deactivated and proven to be removed. However, the bigger question in my mind relates to the addition of the user account in the first place. Why would you knowingly install a backdoor?

Hello, Joshua

Backdoors are nothing new in the computer world. I’d argue the most famous backdoor account in the history of computer hacking belongs to Joshua, the dormant login for the War Operations Programmed Response (WOPR) computer system in the 1983 movie Wargames. Joshua was an old login for the creator to access the system outside of the military chain of command. When the developer was removed from the project the account was forgotten about until a kid discovered it and kicked off the plot of the movie.

Joshua tells us a lot about developers and their desire to have access to the system. I’ll admit I’ve been in the same boat before. I’ve created my own logins to systems with elevated access to get tasks accomplished. I’ve also notified the users and administrators of those systems about my account and let them deal with it as needed. Most were okay with it being there. Some were hesitant and required it to be disabled after my work was done. Either way, I was up front about what was going on.

Joshua and zyfwp are examples of what happens when those systems are installed outside of the knowledge of the operators. What would have happened if the team in the Netherlands hand’t found the account? What if Zyxel devices were getting hacked and networks breached without anyone knowing the vector? I’m sure the account showed up in all the admin dashboards, right?

Easter Egg Hunts

Do you remember the Windows 3.1 Bear? It was a hidden reference in the credits to the development team’s mascot. You had to jump through a hoop to find it by holding down a keystroke combination and clicking a specific square in the Windows logo. People loved finding those little nuggets in the software all the way up to Windows 98.

What changed? Turns out, as part of Microsoft’s Trustworth Computing Initiative in 2002 they removed all undocumented features and code that could cause these kinds of things. It also might have had something to do with the antitrust investigations into Microsoft in the 1990s and how undocumented features in Windows and Office might have given the company a competitive advantage. Whatever the reason, Microsoft has committed to removing undocumented code.

Easter eggs are fun to find but represent the bright side of the dark issue above. What happens when the easter egg in question isn’t a credit roll but an undocumented account? What if the keystroke doesn’t bring up a teddy bear but instead gives the current user account full admin access? You scoff at the possibility but there’s nothing stopping a developer from making that happen.

These issues are part of the reason why all code and features need to be documented. We need to know what’s going on in the program and how it could impact us. This means no backdoors. If there is a way to access the system aside from the controls built in already it needs to be known and be able to be disabled if necessary. If it can’t be disabled then the users need to be aware of that fact and make the choice to not use the software because of security issues.

If you’re following along closely, you should have picked up on the fact that this same logic applies to backdoors that have been mandated by the government too. The current slate of US Senators seem to believe that we need to create keys that allow end-to-end encryption to be weakened and readable by law enforcement. However, as stated by companies like Apple for years, if you create a key for a lock that should only ever be opened under special circumstances you have still created a weakness that can be unlocked. We’ve seen the tools used by intelligence agencies stolen and used to create malware unlike anything we’ve ever seen before. What do you think might happen if they get the backdoor keys to go through encrypted messaging systems?


Tom’s Take

I don’t run Zyxel equipment in my home or anywhere I used to work. But if I did there would be a pile of it in the dumpster after this mess. Having a backdoor is one thing. Purposely making one is another. And having that backdoor discovered and exploited by the Internet is an entirely differently conversation. The only way to be sure that you’ve fixed your backdoor problem is to not have one in the first place. Joshua and zyfwp are what we need to get away from, not what we need to work toward. Malfeasance only stops when you don’t do it in the first place.