Failure Is Fine, Learning Is Mandatory

“Failure is a harsh teacher because it gives the test first and the lesson afterward.” — Vernon Law

I’m seeing a thread going around on Twitter today that is encouraging people to share their stories of failure in their career. Maybe it was a time they created a security hole in a huge application. Perhaps it was creating a routing loop in a global corporation. Or maybe it was something as simple as getting confused about two mailboxes and deleting the wrong one and realizing your mail platform doesn’t have undelete functionality.

We fail all the time. We try our hardest and whatever happens isn’t what we want. Some of those that fail just give up and assume that juggling isn’t for them or that they can never do a handstand. Others keep persevering through the pain and challenge and eventually succeed because they learn what they need to know in order to complete their tasks. Failure is common.

What is different is how we process the learning. Some people repeat the same mistakes over and over again because they never learn from them. In a professional setting, toggling the wrong switch when you create someone’s new account has a very low learning potential because it doesn’t affect you down the road. If you accidentally check a box that requires them to change their password every week you’re not going to care because it’s not on your account. However, if the person you do that to has some kind of power to make you switch it back or if the option puts your job in jeopardy you’re going to learn very quickly to change your behavior.

Object Failure

Here’s a quick one that illustrates how the motivation to learn from failure sometimes needs to be more than just “oops, I screwed up”. I’ll make it a bullet point list to save time:

  • Installed new phone system for school district
  • Used MGCP as the control protocol
  • Need to solve a PRI caller ID issue at the middle school
  • Gateway is at the high school
  • Need to see all the call in the system
  • Type debug mgcp packet detail in a telnet session
  • A. Telnet. Session.
  • Router locks up tight and crashes
  • Hear receptionist from the other room say, “Did you just hang up on me?”
  • Panic
  • Panic some more
  • Jump in my car and break a couple of laws getting across town to restart router that I’m locked out of
  • Panic a lot in the five minutes it takes to reboot and reassociate with CallManager
  • Swear I will never do that again

Yes, I did the noob CCIE thing of debugging packets on a processing device in production because I underestimated the power of phone calls as well as my own stupidity. I got better!

But I promise that if I’d have done this and it would have shut down one phone call or caused an issue for one small remote site I wouldn’t have leaned a lesson. I might even still be doing that today to look at issues. The key here is that I shut down call processing for the entire school district for 20 minutes at the end of the school day. You have no idea how many elementary school parents call the front office at the end of the day. I know now.

Lessons have more impact with stress. It’s something we see in a lot of situations where we train people about how to behavior in high pressure situations. I once witnessed a car accident right in front of me on a busy highway and it took my brain almost ten seconds to process that I needed to call emergency services (911 in the US) even though I had spent the last four years programming that dial peer into phone systems and dialing it for Calling Line ID verification. I’d practiced calling 911 for years and when I had to do it for real I almost forgot what to do. We have to know how people are going to react under stress. Or at least anticipate how people are going to behave. Which is why I always configured 9.911 as a dial peer.

Lessons Learned

The other important thing about failure is that you have to take stock of what you learn in the post-mortem. Even if it’s just an exercise you do for yourself. As soon as you realize you made a mistake you need to figure out how to learn from it and prevent that problem again. And don’t just say to yourself, “I’m never doing that again!” You need to think about what caused the issue and how you can ingrain the learning process into your brain.

Maybe it’s something simple like creating a command alias to prevent you from making the wrong typo again and deleting a file system. Maybe it’s forcing yourself to read popup dialog boxes as you click through the system to make sure you’re deleting the right file or formatting the right disk. Someone I used to work with would write down the name of the thing he was deleting and hold it up to the screen before he committed the command to be sure they matched. I never asked what brought that about but I’m sure it was a ton of stress.


Tom’s Take

I screw up. More often than even I realize. I try to learn as much as I can when I’m sifting through the ashes. Maybe it’s figuring out how I went wrong. Perhaps it’s learning why the thing I wanted to do didn’t work the way I wanted it to. It could even be as simple as writing down the steps I took to know where I went wrong and sharing that info with a bunch of strangers on the Internet to keep me from making the same mistake again. As long as you learn something you haven’t failed completely. And if you manage to avoid making the exact same mistake again then you haven’t failed at all.

Tale From The Trenches: The Debug Of Damocles

My good friend and colleague Rich Stroffolino (@MrAnthropology) is collecting Tales from the Trenches about times when we did things that we didn’t expect to cause problems. I wanted to share one of my own here about the time I knocked a school offline with a debug command.

I Got Your Number

The setup for this is pretty simple. I was deploying a CallManager setup for a multi-site school system deployment. I was using local gateways at every site to hook up fax lines and fire alarms with FXS/FXO ports for those systems to dial out. Everything else got backhauled to a voice gateway at the high school with a PRI running MGCP.

I was trying to figure out why the station IDs that were being send by the sites weren’t going out over caller ID. Everything was showing up as the high school number. I needed to figure out what was being sent. I was at the middle school location across town and trying to debug via telnet. I logged into the router and figured I would make a change, dial my cell phone from the VoIP phone next to me, and see what happened. Simple troubleshooting, right?

I did just that. My cell phone got the wrong caller ID. And my debug command didn’t show me anything. So I figured I must be debugging the wrong thing. Not debug isdn q931 after all. Maybe it’s a problem with MGCP. I’ll check that. But I just need to make sure I’m getting everything. I’ll just debug it all.

debug mgcp packet detail

Can You Hear Me Now?

Veterans of voice are probably screaming at me right now. For those who don’t know, debug anything detail generates a ton of messages. And I’m not consoled into the router. I’m remote. And I didn’t realize how big of a problem that was until my console started scrolling at 100 miles an hour. And then froze.

Turns out, when you overwhelm a router CPU with debug messages, it shuts off the telnet window. It also shuts off the console as well, but I wouldn’t have known that because I was way far way from that port. But I did starting hearing people down the hall saying, “Hello? Hey, are you still there? Weird, it just went dead.”

Guess what else a router isn’t doing when it’s processing a tidal wave of debug messages? It’s not processing calls. At all. For five school sites. I looked down at my watch. It was 2:00pm. That’s bad. Elementary schools get a ton of phone calls within the last hour of being in session. Parents calling to tell kids to wait in a pickup line or ride a certain bus home. Parents wanting to check kids out early. All kinds of things. That need phones.

I raced out of my back room. I acknowledged the receptionists comment about the phones not working. I jumped in my car and raced across town to the high school. I managed not to break any speed limits, but I also didn’t loiter one bit. I jumped out of my car and raced into the building. The look on my face must have warded off any comments about phone system issues because no one stopped me before I got to the physical location of the voice gateway.

I knew things were bad. I didn’t have time to console in and remove the debug command. I did what ever good CCIE has been taught since the beginning of time when they need to remove a bad configuration that broke their entire lab.

I pulled the power cable and cycled the whole thing.

I was already neck deep in it. It would have taken me at least five minutes to get my laptop ready and consoled in. In hindsight, that would have been five wasted minutes since the MGCP debugger would have locked out the console anyway. As the router was coming back up, I nervously looked at the terminal screen for a login prompt. Now that the debugger wasn’t running, everything looked normal. I waiting impatiently for the MGCP process to register with CallManager once more.

I kept repeating the same status CLI command while I refreshed the gateway page in CallManager over and over. After a few more tense minutes, everything was back to normal. I picked up a phone next to the rack and dialed my cell phone. It rang. I was happy. I walked back to the main high school office and told them that everything was back to normal.

Tom’s Take

My post-mortem was simple. I did dumb things. I shouldn’t have debugged remotely. I shouldn’t have used the detail keyword for something so simple. In fact, watching my screen fill up with five sites worth of phone calls in a fraction of a second told me there was too much going on behind the scenes for me to comprehend anyway.

That was the last time I ever debugged anything in detail. I made sure from that point forward to start out small and then build from there to find my answers. I also made sure that I did all my debugging from the console and not a remote access window. And the next couple of times I did it were always outside of production hours with a hand ready to yank the power cable just in case.

I didn’t save the day. At best, all I did was cover for my mistake. If it had been a support call center or a hospital I probably would have been fired for it. I made a bad decision and managed to get back to operational without costing money or safety.

Remember when you’re doing your job that you need to keep an eye on how your job will affect everything else. We don’t troubleshoot in a vacuum after all.