Building Snowflakes On Purpose

We all know that building snowflake networks is bad, right? If it’s not a repeatable process it’s going to end up being a problem down the road. If we can’t refer back to documentation to shows why we did something we’re going to end up causing issues and reducing reliability. But what happens when a snowflake process is required to fix a bigger problem? It’s a fun story that highlights where process can break down sometimes.

Reloaded

I’ve mentioned before that I spent about six months doing telephone tech support for Gateway computers. This was back in 2003 so Windows XP was the hottest operating system out there. The nature of support means that you’re going to be spending more time working on older things. In my case this was Windows 95 and 98. Windows 98 was a pain but it was easy to work on.

One of the most common processes we had for Windows 98 was a system reload. It was the last line of defense to fix massive issues or remove viruses. It was something that was second nature to any of the technicians on the help desk:

  1. Boot from the Gateway tools CD and use GWSCAN to write zeros to the hard drive.
  2. Reboot from the CD and use FDISK to partition the hard disk.
  3. Format the drive.
  4. Insert the Windows 98 OS CD and copy the CAB installation files to a folder on the hard drive.
  5. Run setup from the hard drive and let it complete.
  6. When Windows comes back up, insert the driver CD and let install all the drivers.

The whole process took two or three phone calls to complete. Any time you got to something that would take more than fifteen minutes to complete you logged the steps in the customer trouble ticket and had them call back when it was completed. The process was so standard that it had its own acronym in the documentation – FFR, which stood for “FDISK, Format, Reload”. If you told someone where you were in the process they could finish it no problem.

Me, Me, ME

The whole process was manual with lots of steps and could intimidate customers. At some point in the development process the Gateway folks came up with a solution for Windows ME that they thought worked better. Instead of the manual steps of copying files and drivers and such, Gateway loaded the OS CD with a copy of the image they wanted for their specific model type. The image was installed using ImageCast, an imaging program that just dropped the image down on the drive without the need to do all the other steps. In theory, it was simple and reduced call times for the help desk.

In practice, Windows ME was a disaster to work on. The ImageCast program worked about half the time. If you didn’t pick the right options in the reload process it would partition the hard drive and add a second clean copy of WinME without removing the first one. It would change the MBR to have two installations to choose from with the same identifiers so users would get confused as to which was which. And the image itself seemed to be missing programs and drivers. The fact that there was a Driver CD that shipped with the system made us all wonder what the real idea behind this “improved” process was.

Because Windows ME was such a nightmare to reload, our call center got creative. We had a process that worked for Windows 98. We had all the files we needed on the disks. Why not do it the “right” way? So we did. Informally, the reload process for Windows ME was the same as Windows 98. We would FFR Windows ME boxes and make sure they looked right when they came back. No crazy ImageCasting programs or broken software loads.

The only issue? It was an informal snowflake process. It worked much better but if someone called the main help desk number and got another call center they would be in the middle of an unsupported process. The other call center tech would simply start the regular process and screw up the work we’d done already. To counter that, we would tell the customer to call our special callback voicemail box and not do anything until we called them back. That meant the reload process took many more hours for an already unhappy customer. The end result was better but could lead to frustrations.

Let It Snow

Was our informal snowflake process good or bad? It’s tough to say. It led to happier customers. It meant the likelihood of future support calls was lower because the system was properly reloaded instead of relying on a broken image. However, it was stressful for the tech that worked on the ticket because they had to own the process the whole way through. It also meant that you had to ensure the customer wouldn’t call back and disrupt the process with someone else on the phone.

The process was broken and it needed to be fixed. However, the way to fix the broken process wasn’t easy to figure out. The national line had their process and they were going to stick to it. We came up with an alternative but it wasn’t going to be adopted. Yet, we still kept using our process as often as possible because we felt we were right.

In your enterprise, you need to understand process. Small companies need processes that are repeatable for the ease of their employees. Large companies need processes to keep things consistent. Many companies that face regulatory oversight need processes followed exactly to ensure compliance. The ability to go rogue and just do it the way you want isn’t always desired.

If you have users that are going around the process you need to find out why. We see it all the time. Security rules that get ignored. Documentation requirements that are given the bare minimum effort. Remember the shutdown dialog box in Windows Server 2003? Most people did a token entry before rebooting. And having eighteen entries of “;lkj;kljl;k” doesn’t help figure out what’s going on.

Get your users to tell you why they need a snowflake process. Especially if there’s more than one person relying on it. The odds are very good that they’ve found hiccups that you need to address. Maybe it’s data entry for old systems. Perhaps it’s problem with the requirements or the order of the steps. Whatever it is you have to understand it and fix it. The snowflake may be a better way to do things. You have to investigate and figure it out otherwise your processes will be forever broken.


Tom’s Take

Inbound tech support is full of stories like these. We find a broken process and we go around it. It’s not always the best solution but it’s the one that works for us. However, building a toolbox full of snowflake processes and solutions that no one else knows about is a path to headaches. You need to model your process around what people need to accomplish and how they do it instead of just assuming they’re going to shoehorn their workflow into your checklist. If you process doesn’t work for your people, you’d better get a handle on the situation before you’re buried in a blizzard of snowflakes.

Do You Do What You’ve Always Done?

When I was an intern at IBM twenty something years ago, my job was deploying new laptops to people. The job was easy enough. Transfer their few hundred megabytes of data to the new machine and ensure their email was all setup correctly. There was a checklist that needed to be followed in order to ensure that it was done correctly.

When I arrived for my internship, one of my friends was there finishing his. He was supposed to train me in how to do the job before he went back to school. He helped me through the first day of deploying laptops following the procedure. The next day he handed me a different sheet with some of the same information but in a different order. He said, “I realized we had too many reboots in the process and this way cuts about twenty minutes off the deployment time.” I’m all about saving time so I jumped at the chance.

Everything went smashingly for the next month or so. My friend was back at school and I used his modified procedure to be as productive as possible. One day, my mentor wanted to shadow my deployment day to see how I was doing things. I invited him along and we did the first one. I pulled out my deployment docs and made sure to follow the procedure so I got a good grade. When we were done, my mentor pulled me aside and said, “I noticed you went pretty fast and did some things out of order. Why?” I mentioned that my friend and I had modified the deployment procedure a bit to make it easier and faster. That’s when my mentor hit me with a phrase that I’ve spent a lot of my career deconstructing:

“But this is the way we’ve always done it.”

You Do What You’ve Always Done

Process isn’t a bad thing. It makes jobs easy to break down into steps to assess timelines as well as being able to make a job repeatable. There’s nothing worse than a process that only lives inside the head of one person. If you can’t replicate your job you can’t ever stop doing it. Sure, it may be hard to write things down sometimes but you have to have a way to capture that data.

However, process all needs to make sense. If the process for starting my computer involves pushing the buttons in a certain order, there should be a reason for it. If the process for starting my computer also includes extra steps, such as getting a cup of coffee or banging on the monitor twice, there needs to be a reason for those too. Maybe the employee really likes coffee? Or maybe the startup process for the computer takes long enough that you can get a cup of coffee by the time the login is complete and if you try to do it earlier you’re going to run into slowness or indexing issues.

Documenting the steps is important. You also have to document the reasons behind the steps. Why must we tackle the project in this specific order? If you are doing Task A and then Task B why can’t you do the second task first? If you have no good justification then you should be able to do them in any order. But if Task B requires something from Task A to be able to be completed you need to document that. Otherwise people are going to do them out of order and miss steps.

Going back to our IBM deployment guide, why did there need to be so many reboots in the original document. Well, some of them were necessary. We needed to change the machine name before we joined it to the domain. We needed to log in with the username locally to create the profile before we logged in with the domain account so everything was created properly (this was NT4 domain days). Now, the instructions had us rebooting after every change, which added 3-4 minutes to every step along the way. My friend and I knew we could cut that down and do multiple changes for each reboot as long as they didn’t depend on the others being done first. But the original process was the “way it had always been done” and we had to prove it was better this way.

You’ll Always Get What You’ve Always Got

It’s not enough to challenge the process for no reason. Maybe your way is better. Or faster. Or just easier. But you have to prove it. You have to be willing to examine the process and ensure that what you’re doing is objectively better. It’s like taking a shortcut to work. It may feel faster for you. However if it takes 2 minutes longer on your drive is it really worth it? Or are you doing it because you don’t like driving or walking the other way?

Back to IBM and the laptop deployments. In order to get the revised process approved, I had to prove it was faster and provided the same output. I had to go into the lab and deploy using the old method and capture all the settings and data. Then erase the machine and deploy using the new settings and repeat the data capture process again to make sure the results are the same. Once I could prove that the new process resulted in the same output, we could move on to the second step.

Step Two involved timing the deployment. Now, in order to make sure I wasn’t juicing the numbers in either direction, we had our oldest, slowest laptop deployer use the new instructions. He would regularly take over an hour to setup a new machine with the old directions. We handed him the new page and told him to use this instead. Same steps, just in a different order. He went through them all cold twice and managed to average around 45 minutes for his deployment. He even remarked that our process was better.


Tom’s Take

Once we had it proven that it was faster and easier to do the things the new way we updated the deployment procedures before the next group of interns arrived. I had left my mark on things and proven that “this is what we’ve always done” isn’t always the best justification for things. But you do have to justify why your way is better. Facts beat opinion every day of the week. And if you aren’t willing to do the work to prove why you can make things better, you’ll always be where you are right now.

Changing The Baby With The Bathwater In IT

If you’re sitting in a presentation about the “new IT”, there’s bound to be a guest speaker talking about their digital transformation or service provider shift in their organization. You can see this coming. It’s a polished speaker, usually a CIO or VP. They talk about how, with the help of the vendor on stage with them, they were able to rapidly transform their infrastructure into something modern while at the same time changing processes to accommodate faster IT response, more productive workers, and increase revenue or transform IT from a cost center to a profit center. The key components are simple:

  1. Buy new infrastructure from $vendor
  2. Transform all processes to be more agile, productive, and better.

Why do those things always happen in concert?

Spring Cleaning

Infrastructure grows old. That’s a fact of life. Outside of some very specialized hardware, no one is using the same desktop they had ten years ago. No enterprise is still running Windows 2000 server on an IBM NetFinity server. No one is still using 10Mbps Ethernet over Thinnet to connect their offices. Hardware marches on. So when we buy new things, we as technology professionals need to find a way to integrate them into our existing technology stack.

Processes, on the other hand, are very slow to change. I can remember dealing with process issues when I was an intern for IBM many, many years ago. The process we had for deploying a new workstation had many, many reboots involved. The deployment team worked out a new strategy to streamline deployments and make things run faster. We brought our plan to the head of deployments. From there, we had to:

  • Run tests to prove that it was faster
  • Verify that the process wasn’t compromised in any way
  • Type up new procedures in formal language to match the existing docs
  • Then submit them for ISO approval

And when all those conditions were met, we could finally start using our process. All in all, with aggressive testing, it still took two months.

Processes are things that are thought to be carved in stone, never to be modified or changed in any way for the rest of time. Unless the stones break or something major causes a process change. Usually, that major change is a whole truckload of new equipment showing up on the back dock attached to a consultant telling IT there is a better way (TM) to do things.

Ceteris Paribus

Ceteris Paribus is a latin term that means “all else unchanged”. We use it when we talk about having multiple variables in an equation and the need to keep them constant to be able to measure changes appropriately.

The funny thing about all these transformations is that it’s hard to track what actually made improvements when you’re changing so many things at once. If the new hardware is three or four times faster than your old equipment, would it show that much improvement if you just used your old software and processes on it? How much faster could your workloads execute with new CPUs and memory management techniques? How about collapsing your virtual infrastructure onto fewer and fewer physical servers because of advances there? Running old processes on new hardware can give you a very good idea of how good the hardware is. Does it meet the criteria for selection that you wanted when it was purchased? Or, better still, does it seems like you’re not getting the performance you paid for?

Likewise, how are you able to know for sure that the organization and process changes you implemented actually did anything? If you’re implementing them on new hardware how can you capture the impact? There’s no rule that says that new processes can only be implemented on new shiny hardware. Take a look at what Walmart is doing with OpenStack. They most certainly aren’t rushing out to buy tons and tons of new servers just for OpenStack integration. Instead, they are taking streamlined processes and implementing them on existing infrastructure to see the benefits. Then it’s easy to measure and say how much hardware you need to expand instead of overbuying for the process changes you make.


Tom’s Take

So, why do these two changes always seem to track with each other? The optimist in me wants to believe that it’s people deciding to make positive changes all at once to pull their organization into the future. Since any installation is disruptive, it’s better to take the huge disruption and retrain for the massive benefits down the road. It’s a rosy picture indeed.

The pessimist in me wonders if all these massive changes aren’t somehow tied to the fact that they always come with massive new hardware purchases from vendors. I would hope there isn’t someone behind the scenes with the ear of the CIO pushing massive changes in organization and processes for the sake of numbers. I would also sincerely hope that the idea isn’t to make huge organizational disruptions for the sake of “reducing overhead” or “helping tell the world your story” or, worse yet, “making our product look good because you did such a great job with all these changes”.

The optimist in me is hoping for the best. But the pessimist in me wonders if reality is a bit less rosy.