How Should We Handle Failure?

I had an interesting conversation this week with Greg Ferro about the network and how we’re constantly proving whether a problem is or is not the fault of the network. I postulated that the network gets blamed when old software has a hiccup. Greg’s response was:

Which led me to think about why we have such a hard time proving the innocence of the network. And I think it’s because we have a problem with applications.

Snappy Apps

Writing applications is hard. I base this on the fact that I am a smart person and I can’t do it. Therefore it must be hard, like quantum mechanics and figuring out how to load the dishwasher. The few people I know that do write applications are very good at turning gibberish into usable radio buttons. But they have a world of issues they have to deal with.

Error handling in applications is a mess at best. When I took C Programming in college, my professor was an actual coder during the day. She told us during the error handling portion of the class that most modern programs are a bunch of statements wrapped in if clauses that jump to some error condition in if they fail. All of the power of your phone apps is likely a series of if this, then that, otherwise this failure condition statements.

Some of these errors are pretty specific. If an application calls a database, it will respond with a database error. If it needs an SSL certificate, there’s probably a certificate specific error message. But what about those errors that aren’t quite as cut-and-dried? What if the error could actually be caused by a lot of different things?

My first real troubleshooting on a computer came courtesy of Lucasarts X-Wing. I loved the game and played the training missions constantly until I was able to beat them perfectly just a few days after installing it. However, when I started playing the campaign, the program would crash to DOS when I started the second mission with an out of memory error message. In the days before the Internet, it took a lot of research to figure out what was going on. I had more than enough RAM. I met all the specs on the side of the box. What I didn’t know and had to learn is that X-Wing required the use of Expanded Memory (EMS) to run properly. Once I decoded enough of the message to find that out, I was able to change things and make the game run properly. But I had to know that the memory X-Wing was complaining about was EMS, not the XMS RAM that I needed for Windows 3.1.

Generic error messages are the bane of the IT professional’s existence. If you don’t believe me, ask yourself how many times you’ve spent troubleshooting a network connectivity issue for a server only to find that DNS was misconfigured or down. In fact, Dr. House has a diagnosis for you:

Unintuitive Networking

If the error messages are so vague when it comes to network resources and applications, how are we supposed to figure out what went wrong and how to fix it? I mean, humans have been doing that for years. We take the little amount of information that we get from various sources. Then we compile, cross-reference, and correlate it all until we have enough to make a wild guess about what might be wrong. And when that doesn’t work, we keep trying even more outlandish things until something sticks. That’s what humans do. We will in the gaps and come up with crazy ideas that occasionally work.

But computers can’t do that. Even the best machine learning algorithms can’t extrapolate data from a given set. They need precise inputs to find a solution. Think of it like a Google search for a resolution. If you don’t have a specific enough error message to find the problem, you aren’t going to have enough data to provide a specific fix. You will need to do some work on your own to find the real answer.

Intent-based networking does little to fix this right now. All intent-based networking products are designed to create a steady state from a starting point. No matter how cool it looks during the demo or how powerful it claims to be, it will always fall over when it fails. And how easily it falls over is up to the people that programmed it. If the system is a black box with no error reporting capabilities, it’s going to fail spectacularly with no hope of repair beyond an expensive support contract.

It could be argued that intent-based networking is about making networking easier. It’s about setting things up right the first time and making them work in the future. Yet, no system in a steady state works forever. With the possible exception of the tar pitch experiment, everything will fail eventually. And if the control system in place doesn’t have the capability to handle the errors being seen and propose a solution, then your fancy provisioning system is worth about as much as a car with no seat belts.

Fixing The Issues

So, how do we fix this mess? Well, the first thing that we need to do is scold the application developers. They’re easy targets and usually the cause behind all of these issues. Instead of giving us a vague error message about network connectivity, we need more like lp0 Is On Fire. We need to know what was going on when the function call failed. We need a trace. We need context around process to be able to figure out why something happened.

Now, once we get all these messages, we need a way to correlate them. Is that an NMS or some other kind of platform? Is that an incident response system? I don’t know the answer there, but you and your staff need to know what’s going on. They need to be able to see the problems as they occur and deal with the errors. If it’s DNS (see above), you need to know that fast so you can resolve the problem. If it’s a BGP error or a path problem upstream, you need to know that as soon as possible so you can assess the impact.

And lastly, we’ve got to do a better job of documenting these things. We need to take charge of keeping track of what works and what doesn’t. And keeping track of error messages and their meanings as we work on getting app developers to give us better ones. Because no one wants to be in the same boat as DenverCoder9.


Tom’s Take

I made a career out of taking vague error messages and making things work again. And I hated every second of it. Because as soon as I became the Montgomery Scott of the Network, people started coming to me with every more bizarre messages. Becoming the Rosetta Stone for error messages typed into Google is not how you want to spend your professional career. Yet, you need to understand that even the best things fail and that you need to be prepared for it. Don’t spend money building a very pretty facade for your house if it’s built on quicksand. Demand that your orchestration systems have the capabilities to help you diagnose errors when everything goes wrong.

 

Advertisements

Legacy IT Sucks

In my last few blog posts, I’ve been looking back at some of the ideas that were presented at Future:Net at VMworld this year. While I’ve discussed resource contention, hardware longevity, and event open source usage, I’ve avoided one topic that I think dictates more of the way our networks are built and operated today. It has very little to do with software, merchant hardware, or even development. It’s about legacy.

They Don’t Make Them Like They Used To

Every system in production today is running some form of legacy equipment. It doesn’t have to be an old switch in a faraway branch office closet. It doesn’t have to be an old Internet router. Often, it’s a critical piece of equipment that can’t be changed or upgraded without massive complications. These legacy pieces of the organization do more to dictate IT policies than any future technology can hope to impact.

In my own career, I’ve seen this numerous times. It could be the inability to upgrade workstation operating systems because users relied on WordPerfect for document creation and legacy document storage. And new workstations wouldn’t run WordPerfect. Or perhaps it cost too much to upgrade. Here, legacy software is the roadblock.

Perhaps it’s legacy hardware causing issues. Most wireless professionals agree that disabling older 802.11b data rates will help with network capacity issues and make things run more smoothly. But those data rates can only be configured if your wireless network clients are more modern. What if you’re still running 802.11b wireless inventory scanners? Or what if your old Cisco 7921 phones won’t run correctly without low data rates enabled? Legacy hardware dictates the configuration here.

In other cases, legacy software development is the limiting factor. I’ve run into a number of situations where legacy applications are dictating IT decisions. Not workstations and their productivity software. But enterprise applications like school grade book systems, time tracking applications, or even accounting programs. How about an accounting database that refuses to load if IPX/SPX isn’t enabled on a Novell server? Or an old AS/400 grade book that can’t be migrated to any computer system that runs software built in this century? Application development blocks newer systems from installation and operation.

Brownfields Forever

We’ve reached the point in IT where it’s safe to say that there are no more greenfield opportunities. The myth that there is an untapped area where no computer or network resources exist is ludicrous. Every organization that is going to be computerized is so right now. No opportunities exist to walk into a completely blank slate and do as you will.

Legacy is what makes a brownfield deployment difficult. Maybe it’s an old IP scheme. Or a server running RIP routing. Or maybe it’s a specialized dongle connected to a legacy server for licensing purposes that can’t be virtualized. These legacy systems and configurations have to be taken into account when planning for new deployments and new systems. You can’t just ignore a legacy system because you don’t like the requirements for operation.

This is part of the reason for the rise of modular-based pod deployments like those from Big Switch Networks. Big Switch realized early on that no one was going to scrap an entire networking setup just to deploy a new network based on BSN technology. And by developing a pod system to help rapidly insert BSN systems into existing operational models, Big Switch proved that you can non-disruptively bring new network areas online. This model has proven itself time and time again in the cloud deployment models that are favored by many today, including many of the Future:Net speakers.

Brownfields full of legacy equipment require careful planning and attention to detail. They also require that development and operations teams both understand the impact of the technical debt carried by an organization. By accounting for specialized configurations and needs, you can help bring portions of the network or systems architecture into the modern world while maintaining necessary services for legacy devices. Yes, it does sound a bit patronizing and contrived for most IT professionals. But given the talk of “burn it all to the ground and build fresh”, one must wonder if anyone truly does understand the impact of legacy technology investment.


Tom’s Take

You can’t walk away from debt. Whether it’s a credit card, a home loan, or the finance server that contains those records on an old DB2 database connected by a 10/100 FastEthernet switch. Part of my current frustration with the world of forward-looking proponents is that they often forget that you must support every part of the infrastructure. You can’t leave off systems in the same way you can’t leave off users. You can’t pretend that the AS/400 is gone when you scrap your entire network for new whitebox switches running open source OSes. You can’t hope that the old Oracle applications won’t have screwed up licensing the moment you migrate them all to AWS. You have to embrace legacy and work through it. You can plan for it to be upgraded and replaced at some point. But you can’t ignore it no matter how much it may suck.

Penny Pinching With Open Source

You might have seen this Register article this week which summarized a Future:Net talk from Peyton Koran. In the article and the talk, Peyton talks about how the network vendor and reseller market has trapped organizations into a needless cycle of bad hardware and buggy software. He suggests that organizations should focus on their new “core competency” of software development and run whitebox or merchant hardware on top of open source networking stacks. He says that developers can use code that has a lot of community contributions and shares useful functionality. It’s a high and mighty goal. However, I think the open source part of the equation is going to cause some issues.

A Penny For Your Thoughts

The idea behind open source isn’t that hard to comprehend. Everything available to see and build. Anyone can contribute and give back to the project and make the world a better place. At least, that’s the theory. Reality is sometimes a bit different.

Many times, I’ve had off-the-record conversations with organizations that are consuming open source resources and projects as a starting point for building something that will end up containing many proprietary resources. When I ask them about contributing back to those projects or finding ways to advance things, the response are usually silence. Very rarely, I hear that the organization sees their proprietary developments as a “competitive advantage” that they are going to use to either beat a competitor or build a product that saves them a significant amount of money.

The analogy I like to use for open source is the “Take A Penny” dish that many businesses have next to their cash register. The idea is that people can contribute something small to help out others. If someone needs a penny or two to help make payment for a good or service, they can take one. If they have a few left over they can give back. It’s a way to give back a little now and then.

However, there are a couple of types of people that skew the trend for the penny dish. The first is the person that gives back much more than the norm. That person might put a quarter in the dish or put in four or five pennies at every dish they find. They contribute above the norm often and give a lot. The second type of person takes quite a few pennies from the dish and never gives back. They may see the dish as “free money” to be used to augment their own. They don’t care if all the pennies are gone when they finish just as long as they got what they needed out of it.

Extend this metaphor into the open source community. There are quite a few contributors that put in significant time and effort in their projects. They may have found a way to do it full time or may be paid by their company to participate in projects. These folks are dedicated to the cause.

On the other side of the metaphor are the people and organizations that take what they want from open source and never give back. They never contribute to the project, even if their enhancements are needed and welcome. They take a free or inexpensive starting point and use it to build a product that could be used internally to give the organization an advantage. The key here is that it’s something used internally. The GPL covers distribution of software that is based on GPL code, but it’s not really clear about what happens if that software is consumed internally. An enterprising developer may say to themselves, “As long as I don’t sell it, I don’t have to give my code back.”

Networking For Pocket Change

There are quite a few open source networking projects out there. Quagga, BiRD, and Open vSwitch are great examples of projects that have significant reach and are used by a lot of companies to build great products. However, imagine what would happen if no one gave back to these projects. Imagine what would happen if contributors decided to make their own BGP daemons or OVS-like program and use it without regard for helping others in the community.

Open source software needs developers willing to contribute back to the project. If networking is going to embrace open source projects as Peyton suggested in his talk, it’s going to take a lot more contribution than quiet consumption. Whether or not you agree with the premise that networking vendors are corrupt and evil you do have to concede that they’ve given us mostly stable protocols like BGP and OSPF. These same vendors have contributed ideas back to the standardization process to improve protocols like spanning tree and power over Ethernet. Their contributions helped shape what networking is today.

If the next generation of software based network developers wants to embrace and extend these contributions with open source, they’re going to need to be transparent and communicate with the project leads. They’re also going to need to push back when someone high up the food chain sees the development process as a way to gain an advantage and try and keep it all secretive. If developers aren’t going to give back to the community it negates the advantages of open source and instead takes us back to the days of networks being one-off creations that have no interoperability beyond a few protocols. Islands in a sea of home-grown lava.


Tom’s Take

As anyone that attended Future:Net within earshot of me can attest, I wasn’t overly thrilled with Peyton’s take on the future of networking. I have some deep seeded reservations about the “screw the vendors, build it all yourself” mentality that is pervading organizations today. Not everyone is a development shop. Law firms and schools don’t employ software engineers regularly. If you want to transform those types of users into open source adherents, you need to lead the pack by giving back and talking about what your doing with open source. If you’re not willing to lead the way, stop telling people to take the fork in the road.

Network Longevity – Think Car, Not iPhone

One of the many takeaways I got from Future:Net last week was the desire for networks to do more. The presenters were talking about their hypothesized networks being able to make intelligent decisions based on intent and other factors. I say “hypothesized” because almost everyone admitted that we aren’t quite there. Yet. But the more I thought about it, the more I realized that perhaps the timeline for these mythical networks is a bit skewed in favor of refresh cycles that are shorter than we expect.

Software Eats The World

SDN has changed the way we look at things. Yes, it’s a lot of hype. Yes, it’s an overloaded term. But it’s also the promise of getting devices to do much more than we had ever dreamed. It’s about automation and programmability and, now, deriving intent from plain language. It’s everything we could ever want a simple box of ASICs to do for us and more.

But why are we asking so much? Why do we now believe that the network is capable of so much more than it was just five years ago? Is it because we’ve developed a revolutionary new method for making chips that are ten times smarter and a hundred times faster? No. In fact, the opposite is true. Chips are a little faster and a little cheaper, but they aren’t custom construction. Most companies are embracing merchant silicon from companies like Broadcom and Mellanox. So, where’s all this functionality coming from?

Its’ the “S” in SDN. We’re seeing software driving this development. Yes, it’s the same thing we’ve been saying for years now. But it’s something more now. Admins and engineers aren’t just excited that network devices are more software now. They’re expecting it. They’re looking to the software to drive the development. Gone are the days of CLI-only access. Now, the interfaces are practically built on API-driven capabilities. The Aruba 8400 seems like it was built for the API first, with the GUI being a superset of API functions. People getting into the networking world are focusing on things like Python first and CLI syntax second. Software is winning again.

It’s like the iPhone revolution. The hardware is completely divorced from the software. I can upgrade to the latest version of Apple’s iOS and get most of the functionality I want in the OS. Aside from some hardware specific things the majority of the functions work no matter what. Every year, I get new toys to play with. Every 12 months I can enable new functions and have new avenues available to me. If you think back to the first iterations of the iPhone ten years ago, you can see how far software development has driven this device.

Hardware For The Long Haul

However, for all the grandeur and amazement that iPhone (and Android for that matter) show, it’s also created a side effect that is causing even more problems in the mobile device world that is bleeding over into other electronics. That is the rapid replacement cycle. I mentioned above how awesome it is that you can get new functions every year in your mobile device. But I didn’t mention how people are already looking forward to the newest hardware even though they may have purchased a new phone just 10 months ago.

The desire to get faster by leaps and bounds every year is almost comical at this point. When the new device is delivered an it’s just a bit faster than the last, people are disappointed. Even I was guilty of this when I moved from a 2013 MacBook to a 2016 model. It was only 20% faster. Even with everything else I got in the new package, I found myself chasing performance over every other feature.

The desire to get better performance isn’t just limited to phones and tablets and laptops. When network designers look to increase network performance, they want to do so in a way that makes their users happy. Gone are the days when a simple gigabit network connection would keep someone pleased. We now have to develop gigabit wireless connections to the backbone network, where data is passed to servers connected by 25Gig connections to spines and super spines running at 100Gig (for now). In some cases, that data doesn’t even move any more, and instead short-lived containers are spun up next to it to make things run faster.

Networks aren’t designed like iPhones. They aren’t built to be ripped out every year and replaced with something a little faster and a little shinier. They’re designed more like cars in that regard. They are large purchases that should last for years upon years, with their performance profile dictated by the design at the time. We’ve gone through the phases of running 10Mbit hubs. We upgraded to FastEthernet. We’re now living in a cycle where Gigabit and 10/40 Gigabit switches are ruling the roost. Some of the more forward-thinking folks are adopting 25/50Gigabit and even looking to faster technologies like 400Gig connections for spines.

However, the longevity of hardware remains. Capital expenditure isn’t the easiest thing in the world to accomplish. You need to budget for new devices in the networking world. You need to justify cost savings or new applications. You can’t just buy a fast new switch or two every year and claim that it makes everything better without some kind of justification. Even if you move to the cloud with your strategy, you’re not changing the purchasing model. You’re just trading capital expenditure to operational expenditure. You’re leasing a car instead of buying it. Because if you leave the cloud, you’re back to your old switching model with nothing to show for it.


Tom’s Take

I was pretty hard on the Future:Net presenters because I felt that their talk of ditching money grubbing vendors for the purity of whitebox switching running open source software was a bit heavy handed. Most organizations are in the middle of a refresh cycle and can’t afford to rip everything out and replace it today. Moreover, the value in whitebox switching isn’t realized when you install new hardware. It’s realized when you turn your switch into an iPhone. Where software development rules the day and your hardware fades away in the background. We’re not there yet. We’re still buying hardware for the sake of hardware as the software is catching up. Networking equipment is still built and bought like a car and will be for a few years to come. Maybe we can revisit this topic in 3 years and see how far software will have driven us by then.

Resource Contention In IT – Time Is Never Enough

I’m at Future:NET this week and there’s a lot of talk about the future of what networking is going to look like from the perspective of vendors like Apstra, Veriflow, and Forward Networks. There’s also a great deal of discussion from customers and end users as well. One of the things that I think is being missed in all the talk about resources.

Time Is Not On Your Side

Many of the presenters, like Truman Boyes of Bloomberg and Peyton Maynard-Koran of EA, discussed the idea of building boxes from existing components instead of buying them from established networking vendors like Cisco and Arista. The argument does hold some valid ideas. If you can get your hardware from someone like EdgeCore or Accton and get your software from someone else like Pluribus Networks or Pica8 it looks like a slam dunk. You get 90% to 95% of a solution that you could get from Cisco with much less cost to you overall.

Companies like Facebook and Google have really pioneered this solution. Facebook’s OCP movement is really helping networking professionals understand the development that goes into building their own switches. Facebook’s commitment is also helping reduce the price of the components when an eager person wants to go build an OCP switch from parts they find at Radio Shack or from Amazon.

But, for Facebook, the development of a switch like this or the development of a platform is a sunk cost. Because the important resource to Facebook isn’t time. Facebook has teams of engineers sitting around developing things. For them, the time the least important resource. Time is something they have in abundance. Why is that? Because their development is entirely focused on their product. Google can afford to have 500 people working on a product with an IT focus like Google Reader or Google Wave because that’s what Google hires people to do.

Contrast that with the typical IT department at an enterprise. Even with thousands of users in Marketing, Management, and Finance there are usually only a handful of IT professionals. And those people have to cover storage, compute, networking, wireless, and software. The focus of the average law firm is not using IT resources to create a product. The focus of the business is leveraging IT to provide a service. A finance firm doesn’t have the time resources to commit to developing in-house solutions or creating IT hardware from components and freely available software.

Money, Money, Money.

Let’s look at the other side of the coin. Facebook and Google have oodles and oodles of time to build and develop things. They can get their developers to work together to build the hardware and software to integrate at a deep level. And because they understand it at that level, they can easily debug it instead of asking someone to solve their problem. What Facebook and Google don’t have is money.

To large firms like these, money is more important than time. When you have to purchase networking or storage equipment by the thousands or tens of thousands of units money becomes a huge issue. If you can save a few dollars per switch that can translate to huge savings in the long run. Even Facebook is doing this with OCP. By creating a demand for specific components for these devices, they can drive costs down across the board and save money for them. For Facebook, money is what is tracked for creating their infrastructure. The more that is saved, the more they can do with it.

In the enterprise, money isn’t quite as important as time. Money is important to businesses for sure. You don’t keep the lights on if you aren’t making money. But because IT supports the business and isn’t the entire business, money can be more easily allocated to projects from budgets. There are pools of money that can be used to purchase office furniture, catering services, or IT hardware. These resources can be reallocated efficiently like Facebook allocates time for projects. If the storage array needs to be upgraded or the wireless needs to be refreshed there can be discussions about how to accomplish it. Maybe the CEO doesn’t get a new desk this quarter. Or maybe there needs to be a few new sales discussions to create capital. But money isn’t as valuable as time. If you think I’m crazy try to get 10 minutes on a CEO’s calendar. Versus getting him to sign off on a purchase.

That’s the real value of cloud computing for IT professionals. They aren’t paying for scale or for availability. They’re really paying for time. They’re paying for a process that reduces the amount of time that they spend configuring low level tasks that are menial and time consuming. Building systems takes time. Automation reduces the time it takes. Process reduces it even further. So organizations looking to move to the cloud are essentially trading one resource, money, for a more important resource, time. Likewise, the large cloud providers that are building these systems are trading their resource, time, for a more valuable resource, money.


Tom’s Take

I don’t believe that smaller enterprises will ever truly embrace the idea of building their own OCP switches running custom Linux distros and custom built routing processes. Because to them, time is way too important. Time to focus on the business. Time to focus on supporting the way that the money is made. Time to do more. Likewise, I expect that large enterprises and providers like Facebook will continue to push the envelope of development and create new solutions. Because they have the time to play and test and build. And use those skills to make money. But I never see a world where those two places meet. Because resource contention is different between these two groups and it causes different outcomes. And the value of those resources are unlikely to change without massive disruption.

The Complexity Of Choice

Russ White had an interesting post this week about the illusion of choices and how herd mentality is driving everything from cell phones to network engineering and design. I understand where Russ is coming from with his points, but I also think that Russ has some underlying assumptions in his article that ignore some of the complexity that we don’t always get to see in the world. Especially when it comes to the herd.

Collapse Into Now

Russ talks about needing to get a new mobile phone. He talks about how there are only really two choices left in the marketplace and how he really doesn’t want either of them. While I applaud Russ and his decision to stand up for his principals, there are more than two choices. He could easily purchase a used Windows mobile phone from eBay. He could choose to run a Palm Tree 650 or a Motorola RAZR from 2005. He could even choose not to carry a phone.

You’re probably saying, “That’s not a fair comparison. He needs feature X on his phone, so he can’t use phone Y.”

And you would be right! So right, in fact, that you’ve already missed one of the complexities behind making choices. We create artificial barriers to reduce the complexity of options because we have needs to meet. We eliminate chaos when making decisions by creating order to limit our available pool of resources.

Let’s return to the mobile phone argument for a moment. Obviously, not having a phone is not an option. But what does Russ need on his phone that pushed him to the iPhone? I’m assuming he needs more than just voice calling capability. That means he can’t use a Jitterbug even though it’s a very serviceable phone for the user base that wants that specific function set. I’m also sure he needs a web browser capable of supporting modern web designs. Perhaps it’s a real keyboard and not T9 predictive text for everything you could want to type. That eliminates the RAZR from contention.

What we’re left with when we develop criteria to limit choices is not two things we feel very ambivalent about. It’s actually the resulting set of operations on chaos to provide order. Android and iPhone don’t strongly resemble each other because of coincidence. It’s because the resulting set of decision making by consumers has led them to this point. Windows Mobile also suspiciously looks like Android/iOS for the same reason. If the mobile OS on Russ’s phone looked more like Windows Mobile 5.0 instead, I’m sure his reasons for not choosing iOS or Android would have been much stronger.

Automatic For The People

So, why is it in networking and mobile phones and even extra value meals that we find our choices “eliminated” so frequently? How is it that we have the illusion of choice without many real choices at all? As it turns out, the illusion is there not to reinforce our propensity to make choices, but instead to narrow our list of possibly choices to something greater than the set of Everything In The World.

Let’s take a hamburger for instance. If you want a hamburger, you will probably go to a place that makes them. You’ve already made a lot of choices in just that one decision, but let’s move past the basic choices. What is your hamburger going to look like? One meat patty? Three? Will it have a special kind of sauce? Will it be round? Square? Four inches thick? These are all unconscious choices that we make. Sometimes, we don’t make these choices by eliminating every option. Instead, we make them by choosing a location and analyzing the options available there. If I pick McDonald’s as my hamburger location because of another factor, like location or time, then I’ve artificially limited my choices to what’s available at McDonald’s at that moment. I can’t get a square hamburger with extra bacon. Not because it’s unfair that McDonald’s doesn’t offer that option. But because I made my choice about available options before I ever got to that point. The available set of my choices will include round meat patties and American cheese. If I want something different, I need to pick a different restaurant, not demand McDonald’s give me more choices.

Artificially limiting choices for people making decisions sometimes isn’t about resource availability. It’s about product creation. It’s about assembly and complexity in making devices. It’s about what people make important in their decision making process. Let’s take a pretty well-known example:

On the left is the way cell phones were designed before 2007. Every one of them looks interesting in some way. Some had keyboards. Some had flip options. Some had pink cases or external antennas. Now, look at the cluster on the right. They all look identical. They all look like Apple Mobile Phone Device that debuted in 2007. Why? Because customers were making a choice based on form factor that informed the rest of their choices. In 2008, if you wanted a mobile device that was a black rectangle with a multi-touch screen and a software keyboard, your options were pretty limited. Now, I challenge you to find a phone that doesn’t have those options. It’s like trying to find a car without power windows. They do exist, but you have to make some very specific choices to find one.


Tom’s Take

Some choices are going to be made for us before we ever make our first decision. We can’t buy networking switches at McDonald’s. We can’t eat our mobile phones. But what we can do to give ourselves more choices is realize that the trade off for that is expending energy. We must do more to find alternatives that meet our requirements. We have to “think outside the box”, whether that means finding a used device we really like somewhere or rolling our own Linux distribution from Slackware instead of taking the default Ubuntu installation. It means that we’re going to have to make an effort to include more choices instead of making choices that automatically exclude certain options. And after you’ve done that for more than a few things, you will realize that the illusion of few choices isn’t really an illusion, but a mask that helps you preserve your energy for other things.

Automating Documentation

Tedium is the enemy of productivity. The fastest way for a task to not be done is to make it long, boring, and somewhat complicated. People who feel that something is tedious or repetitive are the ones more likely to marginalize a task. And I think I speak for the entire industry when I say that there is no task more tedious and boring than documentation. So how can we fix it?

Tell Me What You Did

I’m not a huge fan of documentation. When I decide on a plan of action, I rarely write it down step-by-step unless I’m trying to train someone. Even then, it looks more like notes with keywords instead of a narrative to follow. It’s a habit that has been borne out of years of firefighting in networks and calls to “do it faster”. The essential items of a task are refined and reduced until all that remains is the work and none of the ancillary items, like documentation.

Based on my previous life as a network engineer, I can honestly say that I’m not alone in this either. My old company made lots of money doing network discovery engagements. Sometimes these came because the previous admins walked out the door with no documentation. Other times, it was simply because the network had changed so much since the last person made any notes that what was going on didn’t resemble anything like what they thought it was supposed to look like.

This happens everywhere. It doesn’t take many instances of an network or systems professional telling themselves, “Oh, I’ll write it down later…” for later to never come. Devices get added, settings get changed, and not one word is ever written down. That’s the kind of chaos that causes disorganization at best and outages at worst. And I doubt there’s any networking pro out there that hasn’t been affected by bad documentation at one time or another.

So, how do we fix documentation? It’s tedious for sure. Requiring it as part of the process just invites people to find ways around it. And good documentation takes time. Is there a way to combine the lack of time, lack of requirement, and repetition and make documentation something that is done again? I think there is. And it requires a little help from process.

Not Too Late To Automate

Automation is a big thing right now. SDN is driving it. Network complexity is practically requiring it. Yet networking professionals are having a hard time embracing it. Why?

In part, networking pros don’t like to spend hours solving a problem that can be done in minutes. If you don’t believe me, watch one of the old SNL Nick Burns sketches. Nick is more likely to tell you to move than tell you how to fix your problem. Likewise, if a network pro is spending four hours writing an automation script that is supposed to execute a change that can be made in 20 minutes, they’re not going to want to do it. It’s just the nature of the job and the desire of the network professional to make every minute count.

So, how can we drive adoption of automation? As it turns out, automating documentation can be a huge driver. Automation of tedious tasks is exactly the thing that scripting and automation was designed to solve. Instead of focusing on the automation of the task, like adding VLANs to a set of switches, focus on the ability of the system to create documentation on the fly from the change.

Let’s walk through an example. In order for documentation to matter, it has to answer the 5 Ws. How can we automate that?

Let’s start with Who. Automation can create documentation saying user Hollingsworth made a change through an automated process. That helps the accounting side of the house figure out the person making changes in the network. If that person is actually a script, the Who can be changed to reflect that it was an automated process called by a person related to a change ticket. That gives everyone the ability to track the changes back to a given problem. And it can all be pulled in without user intervention.

What is also an easy automation task. List the configuration being applied. At first, the system can simply list the configuration to be programmed. But for menial and repetitive tasks like VLAN additions you can program the system with a real description like “Adding VLANs to $Switch to support $ticket”. Those variables can be autopopulated based on the work to be done. Again, we reference a ticket number in order to prove that these changes are coming from somewhere.

When is also critical. Are these changes happening in a maintenance window? Or did someone check them in in the middle of the day because they won’t cause any problems? (SPOILER ALERT: They will) By required a timestamp for changes, you can track which professionals are being cavalier with their change management. You can also find out if someone is getting into the system after hours to cause problems or attempt to compromise things. Even if the cause of the change is “immediately” due to downtime or emergency, knowing why it had to be checked in right away is a clue to finding problems that recur in the network.

Where is a two-pronged reason. It’s important to check where the changes are going to be applied. Is it going to be done to all switches in the organization? Or just a set in a remote office. Sanity checking via documentation will keep you from bricking your entire organization in one fell swoop. Likewise, knowing where the change is being checked in from is important. Is a remote office trying to change config on HQ switches? Is a remote engineer dialed in making changes related to an open support case? Is someone from a foreign nation making changes via VPN at 4:30am local time? In every case, you’d really want to know what’s going on before those changes get made.

Why is the one that will trip up everything. If you don’t believe me, I’d like to give you the top two reasons why Windows Server 2003 is shut down and rebooted with the shutdown justification dialog box:

  1. a;lkdjfalkdflasdfkjadlf;kja;d
  2. JUST ****ing SHUT DOWN!!!!!

People don’t like justifying their decisions. Even when I worked for Gateway 2000 on their national help desk, our required call documentation was a bit spotty when it came to justification for changes. Why did you decide to FDISK and reload? Why are you going into the registry to fix the icon colors? Change justification is half of documentation. It gives people something to audit. It gives people a way to look at things and figure out why you started down the path of a particular reasoning for problem solving. It also provides context for you after the fact when you can’t figure out why you did it the way you did.


Tom’s Take

Automation isn’t going to take away your job. Automation is going to do the jobs you hate doing. It’s going to make your life easier to concentrate on the tasks that need to be done by freeing you from the tasks that should be done and aren’t. If we can make automation document our networks for just six months, I think you’ll find the value in programming things to work this way. I also think you’ll be happier with the level of detail on your network. And once you can prove the value of automating just one task to your teams, I’m sure they’ll see the value of increasing automation all around.