My good friend and colleague Rich Stroffolino (@MrAnthropology) is collecting Tales from the Trenches about times when we did things that we didn’t expect to cause problems. I wanted to share one of my own here about the time I knocked a school offline with a debug command.
I Got Your Number
The setup for this is pretty simple. I was deploying a CallManager setup for a multi-site school system deployment. I was using local gateways at every site to hook up fax lines and fire alarms with FXS/FXO ports for those systems to dial out. Everything else got backhauled to a voice gateway at the high school with a PRI running MGCP.
I was trying to figure out why the station IDs that were being send by the sites weren’t going out over caller ID. Everything was showing up as the high school number. I needed to figure out what was being sent. I was at the middle school location across town and trying to debug via telnet. I logged into the router and figured I would make a change, dial my cell phone from the VoIP phone next to me, and see what happened. Simple troubleshooting, right?
I did just that. My cell phone got the wrong caller ID. And my debug command didn’t show me anything. So I figured I must be debugging the wrong thing. Not debug isdn q931 after all. Maybe it’s a problem with MGCP. I’ll check that. But I just need to make sure I’m getting everything. I’ll just debug it all.
debug mgcp packet detail
Can You Hear Me Now?
Veterans of voice are probably screaming at me right now. For those who don’t know, debug anything detail generates a ton of messages. And I’m not consoled into the router. I’m remote. And I didn’t realize how big of a problem that was until my console started scrolling at 100 miles an hour. And then froze.
Turns out, when you overwhelm a router CPU with debug messages, it shuts off the telnet window. It also shuts off the console as well, but I wouldn’t have known that because I was way far way from that port. But I did starting hearing people down the hall saying, “Hello? Hey, are you still there? Weird, it just went dead.”
Guess what else a router isn’t doing when it’s processing a tidal wave of debug messages? It’s not processing calls. At all. For five school sites. I looked down at my watch. It was 2:00pm. That’s bad. Elementary schools get a ton of phone calls within the last hour of being in session. Parents calling to tell kids to wait in a pickup line or ride a certain bus home. Parents wanting to check kids out early. All kinds of things. That need phones.
I raced out of my back room. I acknowledged the receptionists comment about the phones not working. I jumped in my car and raced across town to the high school. I managed not to break any speed limits, but I also didn’t loiter one bit. I jumped out of my car and raced into the building. The look on my face must have warded off any comments about phone system issues because no one stopped me before I got to the physical location of the voice gateway.
I knew things were bad. I didn’t have time to console in and remove the debug command. I did what ever good CCIE has been taught since the beginning of time when they need to remove a bad configuration that broke their entire lab.
I pulled the power cable and cycled the whole thing.
I was already neck deep in it. It would have taken me at least five minutes to get my laptop ready and consoled in. In hindsight, that would have been five wasted minutes since the MGCP debugger would have locked out the console anyway. As the router was coming back up, I nervously looked at the terminal screen for a login prompt. Now that the debugger wasn’t running, everything looked normal. I waiting impatiently for the MGCP process to register with CallManager once more.
I kept repeating the same status CLI command while I refreshed the gateway page in CallManager over and over. After a few more tense minutes, everything was back to normal. I picked up a phone next to the rack and dialed my cell phone. It rang. I was happy. I walked back to the main high school office and told them that everything was back to normal.
My post-mortem was simple. I did dumb things. I shouldn’t have debugged remotely. I shouldn’t have used the detail keyword for something so simple. In fact, watching my screen fill up with five sites worth of phone calls in a fraction of a second told me there was too much going on behind the scenes for me to comprehend anyway.
That was the last time I ever debugged anything in detail. I made sure from that point forward to start out small and then build from there to find my answers. I also made sure that I did all my debugging from the console and not a remote access window. And the next couple of times I did it were always outside of production hours with a hand ready to yank the power cable just in case.
I didn’t save the day. At best, all I did was cover for my mistake. If it had been a support call center or a hospital I probably would have been fired for it. I made a bad decision and managed to get back to operational without costing money or safety.
Remember when you’re doing your job that you need to keep an eye on how your job will affect everything else. We don’t troubleshoot in a vacuum after all.