The Splinter In Your Mind

Splinter

You don’t know what it is, but it’s there, like a splinter in your mind, driving you mad. – Morpheus

We’ve all had a moment when we’re troubleshooting an issue and something just doesn’t feel right.  Or we’ve put together a solution and it works but there’s a little voice in the back of your mind telling you that something is missing.  You can’t quite put your finger on it but you know you can’t rest until you’ve figured it out.

The quote above comes from the first Matrix movie, when Morpheus is trying to explain to Thomas Anderson exactly why he feels so out of place in the world.  The term actually predates that movie, having been the title of an excellent Star Wars novel.  Splinter of the Mind’s Eye was released in 1978, so it’s almost as old as I am!  The term describes that feeling you get when something is nagging away at you and won’t let go.  I get it quite often.  Sometimes I recognize a person but can’t remember who they are.  Other times I can’t remember a critical step to a project.  But it comes very often when I’m troubleshooting a problem and the solution is just agonizingly out of reach.

How do you combat the splinter?  What can you do to overcome that feeling that will drive you crazy in short order?  Here are a few things I do.  Sometimes they help, sometimes they don’t.  But the idea is to try and dislodge the splinter and get your thought process rolling again.

Think It Through

This is probably my favorite solution.  When I’m faced with a tough problem and an elusive solution, my first step is to walk through the problem step by step.  If it’s a routing loop, I talk my way through the installation of the route into the routing table.  If the problem is a layer 2 issue, I think through the packet as it goes through the network.  The key is that you envision every step along the way.  Often our minds get distracted by an unimportant step and leave it out.  By going back through and thinking of every piece you often force an overlooked concern to the surface.  This can cause the splinter to move and create a new line of thinking.  Perhaps that routing loop is being caused by a redistribution?  Maybe you didn’t know the network used to run RIP and now is running OSPF.  By imagining the packet moving through the network, you can understand where the problem can occur.

You can also speak out loud when thinking things through.  I find it very useful to actually speak the words as I’m thinking.  That’s because my brain runs much faster than my mouth.  By forcing myself to put the thoughts into words, I can usually slow things down long enough to figure out the missing steps.

Draw It Out

If the problem is a little more nebulous, you might need a piece of paper or a whiteboard to draw things.  I’m not the best artist in the world, but I know that a crude diagram of what I’m thinking about will help me visualize things in a new way.  Maybe I forgot to add a piece to the drawing that fixes the issue.  Other times it just helps me think about the problem.  By filtering the splinter in your mind through another creative process like drawing, you can force it out in a different way.  I was very fond of this at my old job when I had a wall-sized whiteboard.  There’s no reason you can’t do it with a regular piece of paper though.  Colored pencils or markers can also help peel apart the layers of the issue.

Forget About It

Yes, it is strange advice to just forget about a problem.  But, ask yourself how many times you’ve stumbled onto the solution when you’re taking a shower or just about to fall asleep?  The brain is a miraculous computer, but sometimes it has a focus problem.  If you think about something for too long, you can get fatigued and lose your ability to apply critical reasoning.  It’s a “forest for the trees” kind of issue.

I always made it a point when I was troubleshooting a really hard problem to walk away for a few minutes.  Whether it was stepping out to get something to eat or just walking into a conference room for five minutes, I always tried to find some time to clear my mind and refocus on the situation.  By thinking about a shopping list or an order form or even the batting order of the 1962 Yankees you can jar the splinter loose and create new connections.  I always joked with my coworkers that the most efficient way to solve high severity issues was to install a shower in my office.  They didn’t find it nearly as funny as I did.


Tom’s Take

I can’t promise that these solutions are going to fix that nagging feeling in the back of your mind.  Some problems are just that tough.  But when you’ve applied every bit of critical reasoning you can to an issue and you’ve reached the point where your stuck but just can’t let go, sometimes it helps to apply one of the above methods.

If you let the splinter fester in the back of your mind, you’ll constantly be asking yourself what you can do or what you need to look at to fix things.  It will eventually consume you if you let it.  Instead, you should look at a way to move the splinter.  If you can do that you’ll sleep better at night.

Troubleshooting and Triage

When troubleshooting any major issue, people tend to feel a bit lost at first.  There is the crowd that wants to fix the immediate problem.  Then there is the group that wants to look at everything going on and address the root problem no matter how long it takes.  The key to troubleshooting is to realize how each of these approaches has their place and how they are both right and wrong at the same time.

The first approach is triage.  Think of it like a medical emergency room.  Their purpose is to fix the immediate symptoms and stabilize the patient.  Especially critical is the stabilization part.  You can’t fix a network that has bouncing routes or intermittent bridging loops.  Often the true root cause of the problem is buried beneath a pile of other symptoms.  Only when the immediate issues are resolved does the real problem surface.  Learning how to triage problems is a very important troubleshooting skill.  It gives a quick response while allowing the worst of the issue to be dealt with.

It’s important to remember that triage is just a quick fix.  Emergency rooms would never triage a patient without following up with a more in-depth consult or return visit.  Triage fails when engineers leave the patch in place and consider it the final solution.  Most times that I’ve seen this approach have been due to time constraints.  Rather than spending the time to research and test to find the true problem people are content to make the majority of the symptoms go away no matter how briefly.  It happens all the time.

“Just make it work for now.  We’ll fix it later.”

“If we configure it like this, will it stay up until the end of the quarter?”

“We don’t have time to debate this.  The CEO wants things up NOW!”

True in-depth troubleshooting is what happens when we have time and a clear way to solve the deeper root issues.  Deep troubleshooting figures out that the cause of a route flap is actually a bad Ethernet cable.  That’s not something you can easily determine from a quick analysis.  It takes time and effort to figure out.  When I worked on an inbound desktop help desk, we tested for CD-ROM failures by flipping the IDE cables back and forth on the IDE ports on the motherboard.  In part, this was to test to ensure the drive failure followed the switch of cables and ports.  In addition, it also tested the cable and port to make sure the dead drive wasn’t masking a bigger failure.  It took more time to do it properly but we never ran into an issue where a good CD-ROM drive was returned and the problem persisted.

In-depth troubleshooting can fail when there are so many problems masking the real issue that you start trying to fix the wrong problem.  Tunnel vision is easy to get when working on a problem.  If you tunnel in on an ancillary symptom and fail to fix the root cause you aren’t really doing much better than simple triage.  Just like a doctor, you need to ensure that you are treating the real problem under all the symptoms.  Remember not to be sidetracked by each small issue you uncover.  Fix them and keep digging for the real issue underneath it all.


Tom’s Take

I’ve had a lot of people comment that I was able to figure out problems quickly.  They also liked how I was able to “fix” things quickly.  That’s because I was very good at triage.  In my job as a VAR engineer, I didn’t really have time to dig deeper into the issue to uncover root cause.  Thankfully, a couple of the guys that I worked with were the exact opposite of me.  They loved digging into problems and pulling everything apart until they found the real issue.  They were labeled “slow” or “methodical” by some.  I loved working with them because the complemented my style perfectly.  I fix the big issues and make people happy.  They fix the underlying cause and keep them that way.  Just like ER doctors and specialists.  We both have our place.  It’s important to realize which is more important at a given time.

IT Jugglers

Juggle Balls

I once interviewed for a job where the interviewer asked how I decided to work on tasks. He said, “There are two kinds of workers. The first concentrates on a task and does nothing else until it is completed. They can only do one thing at a time. Then, there are the jugglers. Which one are you?” When I responded that I tended toward the latter, the interviewer smiled.  That was obviously the answer he was looking for.

IT is very much defined by focus. Being able to work on a project until it is totally finished is a very admirable quality to be desired. In my experience, especially in the VAR world, it is equally as important to be able to shift your focus quickly to other tasks that require attention. As indicated above, it’s not unlike juggling. Being able to focus on a project for a few hours or days and then move to a different project for a few hours can be a very critical skill for high level engineers.

Technology has been doing this for years. Think about a preemptive multitasking CPU. It appears to be many things at once. It’s really executing instructions for a given process for a period of time (a timeslice). Because you can process enough instructions in that time to accomplish a function it all appears to work like magic. The key is to tune the processor to use the right timeslices. If the timeslice is too long the processor will sit idle waiting for the program to generate new instructions. If the time slice is too short the program won’t be able to execute enough instructions during the window and the program will appear unresponsive. Just like a juggler, it’s all about the timing.

Choosing what to juggle in IT is almost as important as knowing how to do it. When you are just starting out with juggling, you use safe, soft objects to contain the damage. You don’t start off with chainsaws and molotov cocktails. When juggling IT projects, be sure to juggle those that don’t have hard deadlines or require critical path updates on a regular basis. If you’re required to provide a weekly update on an installation, be sure you’ve allocated enough time during the week to do something. Otherwise, that weekly installation report is going to look pretty thin.

When learning to juggle, most people spend entirely too much time worrying about the ball in their hand.  They tend to lose focus of all the other objects floating in the air.  That’s why they tend to start dropping them.  In the same way, you can’t be so dialed in on one project that you completely neglect all the other things going on.  Finding a good point to stop one task and start working on another is a very fine art.

This isn’t for everyone.  If you’re a person that can’t shift focus fast enough to keep all the balls (or projects) in the air without dropping something, you should avoid working on many things at once.  There’s no shame in having laser focus on something.  It works well for a lot of folks.  It gets hard things done right.  It’s just another way to do get the job done.


Tom’s Take

I’m a juggler.  I try to keep everything going at once while I wrap up what I can.  I do my best to avoid dropping things, but something slips through from time to time.  I also taught myself to juggle in real life.  I can keep three tennis balls going with no issues.  I realize my limitations, though.  I know that more than that is too many.  In the project space, I know that having more than I can handle is bad for everything, so I try to keep my focus on a manageable about of juggled things.  It’s better to juggle a few things well than juggle an impressive number of things poorly.  I’ll let you know when I work my way up to the chainsaws.

Solarwinds – The Right Tool For A New Job

solarwinds_logo_tag2012

The first presentation of Networking Field Day 5 day 2 was from our old friends at Solarwinds.  We heard from them before at NFD3, but the nice thing about Solarwinds is that they’ve always got new tools coming out.  I’ve also served as a Thwack Ambassador on their forums and been featured as an IT Spotlight Blogger.  I wanted to see what Solarwinds would bring to the table at NFD5.

The geeks from Solarwinds started out with a quick overview of the tool portfolio.  One thing to take note of: most of the tools that you use a standalone products are actually integrated into the larger Orion platform.  Solarwinds makes some of them available as free downloads for trials or point solutions.  You can get all of them together in one big toolbox, provided you have the horsepower to run it all.  It tend to lean more toward the “right tool, right job” mentality rather than getting the whole box.  For every IP SLA monitor crescent wrench I use regularly, there are a multitude of metric socket sets and emergency break tools that I may never even touch.  That’s why it’s great when Solarwinds makes their software available to all for only the investment of a registration.

You’ll also notice in the video around 20 minutes in, I mention something about Solarwinds and SDN.  Colin McNamara (@colinmacnamara) chided me a bit about “SDN washing” of their technology.  Colin does have a point about overuse of SDN to describe everything under the sun.  Sanjay Castelino even made a post to the effect that what Solarwinds is doing isn’t SDN.  In a sense, he’s right.  These tools aren’t network programmability or overlay networking or even automation.  To me though, a part of what Solarwinds is doing falls under the SDN spectrum in that they can program different devices from a single interface.  Sure, it’s not the sexy sports car idea of network slicing and service instantiation that others are looking at.  Even the ability to quickly configure devices and pull pertinent info from them is better than some of what we’ve got going on right now.  This software allows you to define parameters and configuration in your network.  That’s SDN of some flavor to me.  Maybe not mocha SDN with sprinkles but something a bit different.

This led to a bit of a derailment of the conversation.  The delegates seized on the Solarwinds development model of “giving the customers what they want.”  I’d heard this many times before, so it wasn’t necessarily new to me.  What’s key to me in that message is that you’re going to have a lot of content customers.  Not necessarily happy, but content.  The key difference to me comes from the model.  If you give the customers what they want, they will be pacified.  All their desires are met and the can do their jobs.  However, if you can break outside of the demand-based model and show them something they never knew they needed, you have a real chance to make them deliriously happy.  Think about something like the iPad.  Did we know we needed it before it was released?  Not likely.  Now think about how many people have jumped at the chance to own a tablet device.  If those companies had simply been giving their customers what they asked for think about the market that would have been missed.  I’m not saying that Solarwinds is doing a bad job by any means.  I just think they need to get a geek in the house working on crazy stuff that will make people say “holy cow!!!”

Solarwinds talked to us about their newest network monitoring pieces.  They’ve got some very interesting tools, including Network Performance Monitor.  There was also some discussion around their IP Address Managment (IPAM) tool, which is what I wrote about during my Thwack Ambassadorship.  Thankfully, we had Terry Slattery in the room.  Terry loves the network monitoring discussions, having founded Netcordia and release NetMRI for that purpose before it was purchased by Infoblox.  Terry has seen a lot, and he’s not afraid to tell you what he thinks.  When we discussed the features of User Device Tracker (UDT), he asked if it can do a time-based report on unused switch ports.  When the answer wasn’t clear, he told the geeks, “If you can’t do that, you need to write that down.” We all had a couple of good jokes at their expense, but that fact is that when Terry tells you something is important, especially when it comes to network monitoring the chances are it’s really important.

Solarwinds is also getting into the API game with SWIS – Solarwinds Information Service.  This SOAP interface (soon to be REST) gives you the ability to write programs to pull data from the network and insert/update the same in many devices.  See what I’m talking about with SDN and the ability to pull info from the network and push it back again?  I think Solarwinds really needs to focus their efforts in this area and drive some more programmability from their tools rather than the old methods of just hiding CLI command pushes and things of that nature.  By allowing users to code to an API, you’ve just abstracted all of the icky parts of the backend away and focused the conversation where it needs to be – on getting problems solved.

If you’d like to learn more about Solarwinds, be sure to check them out at http://www.solarwinds.com.  You can also follow them on Twitter as @solarwinds.  Be sure to check out their dicussion forums at http://thwack.solarwinds.com.


Tom’s Take

Solarwinds has awesome tools.  They’re going to have awesome tools in the future.  But they’ve hit on some pieces of the puzzle that are going to do much more than that.  Beyond giving us a toolbox with fancy handles and shiny stickers, they’ve started to do what a lot of other people have done and give us designs for what we should build with the tools they’ve given us.  By expanding into that area of allowing us to program to APIs and put the pieces into a bigger context, they have the ability to transcend being a point product vendor releasing neat toys.  When you can be a meaningful discussion point in any monitoring and management meeting without being dismissed as just a niche player, that’s handy indeed.

Tech Field Day Disclaimer

Solarwinds was a sponsor of Network Field Day 5.  As such, they were responsible for covering a portion of my travel and lodging expenses while attending Network Field Day 5.  In addition, Solarwinds provided me with breakfast at the hotel.  They also gave the delegates a t-shirt and a messenger bag, along with all the stickers and buttons we could fit into our carry ons.  At no time did they ask for, nor where they promised any kind of consideration in the writing of this review.  The opinions and analysis provided within are my own and any errors or omissions are mine and mine alone.

Why Is My SFP Not Working?

GenericSFP

It’s 3 am. You’ve just finished installing your new Catalyst switches into the rack and you’re ready to turn them up and complete your cutover. You’ve been fighting for months to get the funding to get these switches so your servers can run at full gigabit speed. You had to cut some corners here and there. You couldn’t buy everything new, so you’re reusing as much of your old infrastructure as possible. Thankfully, the last network guy had the foresight to connect the fiber backbone at gigabit speeds. You turn on your switches and wait for the interminably long ASIC and port tests to complete. As you watch the console spam scroll up on your screen, you catch sight of something that makes your blood run cold:

%GBIC_SECURITY_CRYPT-4-VN_DATA_CRC_ERROR: GBIC in port 65586 has bad crc
 %PM-4-ERR_DISABLE: gbic-invalid error detected on Gi1/0/50, putting Gi1/0/50 in err-disable state

Huh?!? Why aren't my fiber connections coming up? Am I going to have to roll the install back? What is going on here?!?

You will see this error message if you have a third party SFP inserted into the Catalyst switch. While Cisco (and many others) OEM their SFP transceivers from different companies, they all have a burned-in chip that contains info such as serial number, vendor ID, and security info like a Cyclic Redundancy Check (CRC). If any of this info doens't match the database on the switch, the OS will mark the SFP as not supported and disable the port. The fiber connection won't come up and you'll find yourself screaming at terminal window at 3:30 in the morning.

Why do vendors do this? Some claim it's vendor lock in. You are stuck ordering your modules from the vendor at an inflated cost instead of buying them from a different source. Others claim it's to help TAC troubleshoot the switch better in case of a failure. Still others say that it's because the manufacturing tolerances on the vendor SFPs is much better than the third party offerings, even from the same OEM. I don't have the answer, but I can tell you that Cisco, HP, Dell, and many others do this all the time.

HP is the most curious case that I've run into. Their old series A SFP modules (HP calls them mini-GBICs) didn't even have an HP logo. They bore the information from Finisar, an electroics OEM. The above scenario happened to me when I traded out a couple of HP 2848 swtiches for some newer 2610s. The fiber ports locked up solid and would not come alive for anything. I ended up putting the old switches back in place as glorified fiber media converters until I figured out that new SFPs were needed. While not horribly expensive, it did add a non-trivial cost to my project, not to mention all the extra hours of troubleshooting and banging my head against a wall.

Cisco has an undocumented and totally unsupported solution to this problem. Once you start getting the console spam from above, just enter these commands:

service unsupported-transceiver
no errdisable detect cause gbic-invalid

These commands are both hidden, so you can't ? them. When you enter the first command, you get the Ominous Warning Message of Doom:

Warning: When Cisco determines that a fault or defect can be traced to the use of third-party transceivers installed by a customer or reseller, then, at Cisco's discretion, Cisco may withhold support under warranty or a Cisco support program. In the course of providing support for a Cisco networking product Cisco may require that the end user install Cisco transceivers if Cisco determines that removing third-party parts will assist Cisco in diagnosing the cause of a support issue.

It goes without saying that calling TAC with a non-Cisco SFP in the slot is going to get you an immediate punt or request to remove said offending SFP. You'll likely argue that your know the issue isn't with the SFP that was working just fine an hour ago. They will counter with not being able to support non-Cisco gear. You'll complain that removing the SFP will create additional connectivity issues and eventually you'll hang up in frustration. So, don't call TAC if you use this command. In fact, I would counsel that you should only use this command as a short term band-aid to get your out of the data center at 3 am so you can order genuine SFPs the next morning. Sadly, I also know how budgets work and how likely you are to get several hundred dollars of extra equipment you "forgot" to order. So caveat implementor.

Data Never Lies

lies

If you’ve been watching the media in the last couple of weeks, you’ve probably seen the spat that has developed between John Broder of the New York Times and Elon Musk of Tesla Motors.  Broder took a Tesla Model S sedan on a test drive from New Jersey to Connecticut to test out the theory that the new supercharger stations that have been installed along the way would help electric cars to take long road trips without fear of running out of electricity.  Along the way, he ran into some difficulty and ultimately needed to have the car towed to a charging station.  After the story came out, Elon Musk immediately defended his product with a promise of data to support that assertion.  A couple of days later, he put up a long post on the Tesla blog with lots of charts, claiming that the Model S had lots of data to support longer driving distances, failure to fully charge at supercharger stations, and even that Broder was driving in circles in a parking lot.  After this post, Broder responded with another post of his own clarifying the rebuttal made by Musk and reaffirming how the test was carried out.  It’s certainly made for some interesting press releases and blog posts.  There has also been a greater discussion about how we present facts and dat in a case to support our argument or prove the other party is wrong.

Data Doesn’t Lie

If nothing else, Elon Musk did the right thing by attaching all manner of charts and graphs to his blog post.  He provided data (albeit collated and indexed) from the vehicle that gave a more precise picture of what went on than the recollection of a reporter that admittedly didn’t remember what he did or didn’t do during portions of the test drive.  Data never lies.  It’s a collection of facts and information that tells a single story.  If equals 7, there’s no other thing that could be.  However, the failing in data usually doesn’t come from the data itself.  It comes from interpretation.

Data Doesn’t Lie.  People Do.

The problem with the Elon Musk post is that he used the data to support his assertion that Broder did things like taking a long detour through Manhattan and driving in circles for half a mile in a parking lot in an attempt to force the car to completely discharge its battery.  This is the part where the narrative starts to break down and where most critics are starting their analysis.  Musk was right to include the data.  However, the analysis he offers is a bit wild.  Does rapid acceleration and deceleration over a short span of distance mean Broder was driving in circles attempting to drain the car?  Or was he lost in the dark, trying to find the charging station in the middle of the night like he claims in his rebuttal?  The data can only tell us what the car did.  It can’t explain the intentions of someone that wasn’t being monitored by sensors.

Let The Data Do The Talking

How does this situation apply to us in the networking/virtualization/IT world?  We find ourselves adrift in a sea of data.  We have protocols providing us status information and feeding us statistics around the clock.  We have systems that will correlate that data and provide a big picture.  We have system to aggregate the correlated data and sort it into action items and critical alert levels.  With all this data, it’s very easy for us to make assumptions about what we see.  The human brain wants to make patterns out of what we see in front of us.  The problem comes when the conclusion we reach is incorrect.  We may have a preconceived notion of what we want the data to say.  Sometimes its confirmation bias.  Other times its reporting bias.  We come to incorrect conclusions because we keep trying to make the data tell our story instead of listening to what the data tells us.  Elon Musk wanted the data to tell him (and us) that his car worked just fine and that the driver must have had some ulterior motive.  John Broder used the same data to support that while his recollection of some finer details wasn’t accurate in the original article, he harbored no malice during his test.  The data didn’t lie in either case.  We just have to decide who’s story is more accurate.

Tom’s Take

The smartest thing that you can do when providing network data or server statistics is leave your opinion out of it.  I make it a habit to give all the data I can to the person requesting it before I ever open my mouth.  Sure, people pay me to look at all that information and make sense of it.  Yes, I’ve been biased in my conclusions before.  I realize that I’m nowhere near neutral in many of my interpretations, whether it be defending the actions of myself or my team or using the data to support the correctness of a customer’s assumptions.  The key to preventing a back-and-forth argument is to simply let the data do all the talking for you.  If the data never lies, it can’t possibly lose the argument.  Let the data help you.  Don’t make the data do your dirty work for you.

The Card Type Command – Don’t Flop

If you’ve ever found yourself staring at a VWIC2-1MFT-T1/E1 or a NM1-T3/E3 module, you know that you’ve got some configuration work ahead of you.  Whether it be for a PRI circuit to hook up that new VoIP system or a DS3 to get a faster network connection, the T1/T3 circuit still exists in many places today.  However, I’ve seen quite a few people that have been stymied in their efforts to get these humble interface cards connected to a router.  I have even returned a T1/E1 card myself when I thought that it was defective.  Imagine the egg on my face when I discovered that the error was mine.

It turns out that ordering the T1/E1 or T3/E3 module from Cisco requires a little more planning on the installation side of things.  These cards can have a dual identity because the delivery mechanism for these circuits is identical.  In the case of a T1/E1, the delivery mechanism is almost always over an unshielded twisted pair (UTP) cable.  Almost all of the T3/E3 circuits that I’ve installed have been delivered over fiber but terminated via coax cables with BNC connectors.  The magic, then, is in the location.  A T1 circuit is typically delivered in North America, while the E1 circuit is European version.  There are also differences in the specifics of each circuit.  A T1 is 24 channels of 64kbits each.  An E1 is 32 channels of the same size.  This means that a T1 has an effective data rate of 1.544 Mbits while an E1 is a bit faster a 2.048 Mbits.  There are also framing differences and a slightly different signaling structure.  The long and short of it is that T1 and E1 circuits are incompatible with each other.  So how does Cisco manage to ship a module that supports both circuit types?

The key is that you must choose which circuit you are going to support when you install the card.  The card can’t automatically flip back and forth based on circuit detection.  Where the majority of issues come from in my line of work is that the card doesn’t show up as a configurable interface until you force a circuit type.  This is accomplished by using the card type command:

RouterA(config)#card type ?
 e1 E1
 e3 E3
 t1 T1
 t3 T3

Choose your circuit type and away you go!  As soon as you enter the card type, the appropriate serial interface is created.  You will still need to enter the controller interface to set parameters like the framing and line code.  However, the controller interface only shows up when the card type has been set as well.  So unless you've done the first step, there isn't going to be a place to enter any additional commands.


Tom's Take

Sometimes there are things that seem so elementary that you forget to do them.  Checking a power plug, flipping a light switch, or even remembering to look for little blinking lights.  We don't think about doing all the easy stuff because we're concentrating on the hard problems.  After all our hard work, we know it has to be something really messed up otherwise it would be fixed by now.  In the case of T1/E1 cards, I made that mistake.  I forgot to check everything before declaring the card dead on arrival.  Now, I find myself spending a lot of time providing that voice of reason for others when they're sure that it has to be something else.  The little voice of reason doesn't always have to be loud, sometimes it just has to say something at the right time.

Upgrading to Cisco Unified Presence Server 8.6(4) – Caveat Jabber

With the Jabber for Everyone initiative that Cisco has been pushing as of late, the question about compatibility between the Jabber client and Cisco Unified Presence Server (CUPS) has come into question more than once.  Cisco has been pretty clear on the matter since May – you must upgrade to CUPS 8.6(4) [PDF Link] to take advantage of the Jabber for Everyone.  This version was released on June 16th, and being the diligent network engineer that I am, I had already upgraded to 8.6(3) previously.  This week I finally had enough down time to upgrade to 8.6(4) to support Jabber for Everyone as well as some of the newer features of the Jabber client for Windows.  Of course, that’s where all my nightmares started.

I read the release notes and found that 8.6(4) was a refresh upgrade.  I’ve already done one of these previously on my CUCM server so I knew what to expect.  I prepped the upgrade .COP file for download prior to installing the upgrade itself.  Luckily for me, 8.6(3) is the final version prior to the refresh upgrade, so it doesn’t require the upgrade .COP file to perform the upgrade.  The necessary schema extensions and notification fields are already present.  With all of the release note prerequisites satisfied, I fired up my FTP server and began the upgrade process.  As is my standard procedure, I didn’t let the server upgrade to the new version automatically.  I figured I’d let the upgrade run for a while and reboot afterwards.  After a couple of hours, I ordered the server to reboot and perform the upgrade.  Imagine my surprise when the server came back up with 8.6(4) loaded, but none of the critical services were running.  Instead, the server reported that only backups and restorations were possible.  I was puzzled at this, as the upgrade had appeared to work flawlessly.  After tinkering with things for a bit, I decided to revert my changes and roll back to 8.6(3).  After a quick reboot, the old version came back up.  Only this time, the critical services were stuck in the “starting” state.  Seemed that I was doomed either way.  After I verified the MD5 checksum of the upgrade file, I started the upgrade for the second time.  While I waited for the server to install the second time, I strolled over to the Internet to find out if anyone was having issues with this particular upgrade.

After some consulting, it turns out that Cisco pulled a bone-headed mistake with this upgrade.  Normally, one can be certain that any hardware-specific changes will be contained to major version upgrades.  For instance, upgrading from Windows XP to Windows 7 might entail hardware requirement changes, like additional RAM.  Point releases are a little more problematic.  Cisco uses the minor version to constitute bi-annual system releases.  So CUCM 8.0 had a certain set of hardware requirements, but CUCM 8.5 had different ones.  In that particular case, it was a higher RAM requirement.  However, for CUPS 8.6(4), the RAM requirement doubled to 4 GB.  For a sub-minor point release.  Worse yet, this information didn’t actually appear in the release notes themselves.  Instead, I had to stumble across a totally separate page that listed specific hardware requirements for MCS server types.  Even within that page, the particular model of server that I am using (MCS-7825-I3) is listed as compatible (with caveats).  Turns out that any 8.6(x) release is supposed to require more than 4GB of RAM to function correctly.  Except I was able to install 8.6(3) with no issues on 2GB of RAM.  Since I knew I was going to need to test 8.6(4), I rummaged around the office until I was able to dig up the required RAM (PC2-5300 ECC in case you’re curious).  Without the necessary amount of RAM, the server will only function in “bridge mode” for migrations to new hardware.  This means that your data is still intact on the CUPS server, but none of the services will start to begin processing user requests.  At least knowing that might prevent some stress.

For those of you that aren’t lucky enough to have RAM floating around the office and you’ve gotten as far as I have, reverting the server back to 8.6(3) isn’t the easiest thing to do.  Turns out that moving back to 8.6(x) from 8.6(4) requires a little intervention.  As found on the Cisco Support Forums, rolling back can only be accomplished by installing the ciscocm.cup.pe_db_install.cop file.  But there are two problems.  First, this file is not available anywhere on Cisco’s website.  The only way you can get your hands on it is to request it from TAC during a support call.  That’s fortunate, because problem number 2 is that the file is unsigned.  That means that it will fail the installation integrity check when you try to install it on the CUPS server.  You have to have TAC remote connect to the server and work some support voodoo to get it working.  Now, I suppose if you have a way to gain root access to a Cisco Telephony OS shell, you could do something like the outlined steps in the forum post (as follows):

Here what's required to temporarily install unsigned COP files

cd /usr/local/bin/base_scripts
mv SIGNED_FILTER SIGNED_TEMP

Here what's required in Remote Access to remove the temporary fix

cd /usr/local/bin/base_scripts
mv SIGNED_TEMP SIGNED_FILTER

Note: This is totally unsupported by me.  I'm putting it here for posterity.  Don't call me if you blow up your server.  Also, I don't have the TAC .COP file either, so don't bother asking for it.

That being said, the above instructions should get you back up and running on 8.6(3) until you can buy some RAM from Newegg or your other preferred vendor.


Tom's Take

Yes, I should have read the release notes a little more closely.  Yes, I should have verified the compatibility before I ran wild with this upgrade.  However, having fallen on my sword for my own mistakes, I think it's well within my rights to call Cisco out on this one as well.  How do you not put a big, huge, blinking red line in the release notes warning people that you need to check the amount of RAM in the server before performing an upgrade?  You figure something like this would be pretty important to know?  Worse yet, why did you do this on a sub-minor point release?  When I install Windows 7 Service Pack 1 or OS X 10.7.4, I feel pretty confident that the system requirements for the original OS version will suffice for the minor service pack.  Why up the hardware requirements for CUPS for a minor upgrade at best?  Especially one that you're driving all your people to be on to support your big Jabber initiative?  Why not hold off on the requirement until the CUCM 9 system release became final (which happened about a week later)?  If I'm moving from 8.6 to 9.0, I would at least expect a bunch of hardware to be retired and for things to not work correctly when moving to a new, big major version.  From now on I'm going to be a lot more careful when checking the release notes.  Cisco, you should be a lot more diligent in using the release notes to call attention to things that are important for that release.  The more caveats we know about up front, the less likely we are to jabber about them afterwards.

Fix The Problem, Not The Blame

Courtesy of Zazzle.com

Ethan Banks is really turning out some good blog posts as of late.  His latest one about failure in particular really got me to thinking.  You should head over and read it before you continue.

After I read through Ethan’s post, I started thinking about why people tend to shift responsibility and fire up the “blamethrower” from time to time.  It reminded me of Rising Sun, a movie based on a Michael Crichton book of the same name.  The movie in particular stands out to me because of a quote from Sean Connery:

“The Japanese have a saying: ‘Fix the problem, not the blame.’ Find out what’s [screwed] up and fix it.  Nobody gets blamed.  We’re always after who [screwed] up.  Their way is better.”

This is the kind of thing that leads to people shirking failure.  People are so worried about getting blamed for things that they won’t admit to them.  Whether it be for something simple like misspelling someone’s name or something major like crashing the core router, people don’t want to get blamed.  Most of the time, I can’t fault them for that.  Think about what happens when something goes wrong.  More often than not someone higher up in the organization starts head hunting.  They stalk the halls asking, “Whose fault is this? I want them in my office now!”  How many times have you seen a situation where yelling at the responsible party took precedence over fixing things?  As a VAR providing support to multiple different types of customers, I can tell you that I’ve witnessed first hand several occasions where my job couldn’t begin until the responsible parties were dealt with.  Precious seconds and minutes can tick by while blame is appropriately assigned.

Personally, I take the opposite approach to things.  When I find myself in a situation of troubleshooting or solving problems, I make sure that blame is the last thing that is discussed.  When the CxO comes stalking through the office looking for someone to yell at, I always make sure to direct attention away from the people doing the work.  In my mind, the key to any successful problem resolution lies not in assigning blame but in fixing the problem.  After the crisis is over and cooler heads are prevalent is the time to begin examining causes discussing resolutions to prevent repeat performances.  The above quote from Rising Sun not only reflects my views about the uselessness of blame in a professional environment but serves to show how useful and refreshing fixing problems can be.  At times, I even assume more blame than necessary if it means moving things along.  My goal as a network engineer is problem resolution, not blame assignment.  That’s not to say that I won’t give someone a stern reprimand if necessary.  I’d just rather not have that happening in the heat of the moment when the network team is trying their best to keep the core from melting into a pile of slag.

To be an effective problem solver, make sure to focus all your efforts on fixing the problems.  By forcing all the stakeholders to expend their efforts on the real source of stress, your reputation will grow into something amazing.  People will talk about your ability to solve any problem.  They’ll comment that you’re cool under pressure and great at motivating people when things are at their worst.  You’ll be known as the person that solves problems quickly and makes sure that your team knows what went wrong to prevent it from happening in the future.  These are all very desired traits for people in a troubleshooting capacity.  They can all be yours provided you spend your time looking at the real issues and not worrying about those that are generated from them.

Minimizing MacGyver

I’m sure at this point everyone is familiar with (Angus) MacGyver.  David Lee Zlotoff created a character expertly played by Richard Dean Anderson that has become beloved by geeks and nerds the world over.  This mulletted genius was able to solve any problem with a simple application of science and whatever materials he had on hand.  Mac used his brains before his brawn and always before resorting to violence of any kind.  He’s a hero to anyone that has ever had to fix an impossible problem with nothing.  My cell phone ringtone is the Season One theme song to the TV show.  It’s been that way ever since I fixed a fiber-to-Ethernet media converter with a paper clip.  So it is with great reluctance that I must insist that it’s time network rock stars move on from my dear friend MacGyver.

Don’t get me wrong.  There’s something satisfying about fixing a routing loop with a deceptively simple access list.  The elegance of connecting two switches back-to-back with a fiber patch cable that has been rigged between three different SC-to-ST connectors is equally impressive.  However, these are simply parlor tricks.  Last ditch efforts of our stubborn engineer-ish brains to refuse to accept failure at any cost.  I can honestly admit that I’ve been known to say out loud, “I will not allow this project to fail because of a missing patch cable!”.  My reflexes kick in, and before I know it I’m working on a switch connected to the rest of the network by a strange combination of bailing wire and dental floss.  But what has this gained me in the end?

Anyone that has worked in IT knows the pain of doing a project with inadequate resources or insufficient time.  It seems to be a trademark of our profession.  We seem like miracle workers because we can do the impossible from less than nothing.  Honestly though, how many times have we put ourselves into these positions because of hubris or short-sightedness?  How many times have we equivocated to ourselves that a layer 2 switch will work in this design?  Or that a firewall will be more than capable of handling the load we place on it even if we find out later that the traffic is more than triple the original design?  Why do we subject ourselves to these kinds of tribulations knowing that we’ll be unhappy unless we can use chewing gum and duct tape to save the day?

Many times, all it takes is a little planning up front to save the day.  Even MacGyver does it. I always wondered why he carried a roll of duct tape wherever he went.  The MacGyver Super Bowl Commercial from 2009 even lampooned his need for proper preparation.  I can’t tell you the number of times I’ve added an extra SFP module or fiber patch cable knowing that I would need it when I arrived on site.  These extra steps have saved me headaches and embarrassment.  And it is this prior proper planning that network engine…rock stars must rely on in order to do our jobs to the fullest possible extent.  We must move away from the bailing wire and embrace the bill of materials.  No longer should we carry extra patch cables.  Instead we should remember to place them in the packages before they ship.  Taking things for granted will end in heartache and despair.  And force us to rely less on our brains and more on our reflexes.

Being a Network MacGyver makes me gleam with pride because I’ve done the impossible.  Never putting myself in the position to be MacGyver makes me even more pleased because I don’t have to break out the duct tape.  It means that I’ve done all my thinking up front.  I’m content because my project should just fall into place without hiccups.  The best projects don’t need MacGyver.  They just need a good plan.  I hope that all of you out there will join me in leaving dear Angus behind and instead following a good plan from the start.  We only make ourselves look like miracle workers when we’ve put ourselves in the position to need a miracle.  Instead, we should dedicate ourselves to doing the job right before we even get started.