Solarwinds – The Right Tool For A New Job

solarwinds_logo_tag2012

The first presentation of Networking Field Day 5 day 2 was from our old friends at Solarwinds.  We heard from them before at NFD3, but the nice thing about Solarwinds is that they’ve always got new tools coming out.  I’ve also served as a Thwack Ambassador on their forums and been featured as an IT Spotlight Blogger.  I wanted to see what Solarwinds would bring to the table at NFD5.

The geeks from Solarwinds started out with a quick overview of the tool portfolio.  One thing to take note of: most of the tools that you use a standalone products are actually integrated into the larger Orion platform.  Solarwinds makes some of them available as free downloads for trials or point solutions.  You can get all of them together in one big toolbox, provided you have the horsepower to run it all.  It tend to lean more toward the “right tool, right job” mentality rather than getting the whole box.  For every IP SLA monitor crescent wrench I use regularly, there are a multitude of metric socket sets and emergency break tools that I may never even touch.  That’s why it’s great when Solarwinds makes their software available to all for only the investment of a registration.

You’ll also notice in the video around 20 minutes in, I mention something about Solarwinds and SDN.  Colin McNamara (@colinmacnamara) chided me a bit about “SDN washing” of their technology.  Colin does have a point about overuse of SDN to describe everything under the sun.  Sanjay Castelino even made a post to the effect that what Solarwinds is doing isn’t SDN.  In a sense, he’s right.  These tools aren’t network programmability or overlay networking or even automation.  To me though, a part of what Solarwinds is doing falls under the SDN spectrum in that they can program different devices from a single interface.  Sure, it’s not the sexy sports car idea of network slicing and service instantiation that others are looking at.  Even the ability to quickly configure devices and pull pertinent info from them is better than some of what we’ve got going on right now.  This software allows you to define parameters and configuration in your network.  That’s SDN of some flavor to me.  Maybe not mocha SDN with sprinkles but something a bit different.

This led to a bit of a derailment of the conversation.  The delegates seized on the Solarwinds development model of “giving the customers what they want.”  I’d heard this many times before, so it wasn’t necessarily new to me.  What’s key to me in that message is that you’re going to have a lot of content customers.  Not necessarily happy, but content.  The key difference to me comes from the model.  If you give the customers what they want, they will be pacified.  All their desires are met and the can do their jobs.  However, if you can break outside of the demand-based model and show them something they never knew they needed, you have a real chance to make them deliriously happy.  Think about something like the iPad.  Did we know we needed it before it was released?  Not likely.  Now think about how many people have jumped at the chance to own a tablet device.  If those companies had simply been giving their customers what they asked for think about the market that would have been missed.  I’m not saying that Solarwinds is doing a bad job by any means.  I just think they need to get a geek in the house working on crazy stuff that will make people say “holy cow!!!”

Solarwinds talked to us about their newest network monitoring pieces.  They’ve got some very interesting tools, including Network Performance Monitor.  There was also some discussion around their IP Address Managment (IPAM) tool, which is what I wrote about during my Thwack Ambassadorship.  Thankfully, we had Terry Slattery in the room.  Terry loves the network monitoring discussions, having founded Netcordia and release NetMRI for that purpose before it was purchased by Infoblox.  Terry has seen a lot, and he’s not afraid to tell you what he thinks.  When we discussed the features of User Device Tracker (UDT), he asked if it can do a time-based report on unused switch ports.  When the answer wasn’t clear, he told the geeks, “If you can’t do that, you need to write that down.” We all had a couple of good jokes at their expense, but that fact is that when Terry tells you something is important, especially when it comes to network monitoring the chances are it’s really important.

Solarwinds is also getting into the API game with SWIS – Solarwinds Information Service.  This SOAP interface (soon to be REST) gives you the ability to write programs to pull data from the network and insert/update the same in many devices.  See what I’m talking about with SDN and the ability to pull info from the network and push it back again?  I think Solarwinds really needs to focus their efforts in this area and drive some more programmability from their tools rather than the old methods of just hiding CLI command pushes and things of that nature.  By allowing users to code to an API, you’ve just abstracted all of the icky parts of the backend away and focused the conversation where it needs to be – on getting problems solved.

If you’d like to learn more about Solarwinds, be sure to check them out at http://www.solarwinds.com.  You can also follow them on Twitter as @solarwinds.  Be sure to check out their dicussion forums at http://thwack.solarwinds.com.


Tom’s Take

Solarwinds has awesome tools.  They’re going to have awesome tools in the future.  But they’ve hit on some pieces of the puzzle that are going to do much more than that.  Beyond giving us a toolbox with fancy handles and shiny stickers, they’ve started to do what a lot of other people have done and give us designs for what we should build with the tools they’ve given us.  By expanding into that area of allowing us to program to APIs and put the pieces into a bigger context, they have the ability to transcend being a point product vendor releasing neat toys.  When you can be a meaningful discussion point in any monitoring and management meeting without being dismissed as just a niche player, that’s handy indeed.

Tech Field Day Disclaimer

Solarwinds was a sponsor of Network Field Day 5.  As such, they were responsible for covering a portion of my travel and lodging expenses while attending Network Field Day 5.  In addition, Solarwinds provided me with breakfast at the hotel.  They also gave the delegates a t-shirt and a messenger bag, along with all the stickers and buttons we could fit into our carry ons.  At no time did they ask for, nor where they promised any kind of consideration in the writing of this review.  The opinions and analysis provided within are my own and any errors or omissions are mine and mine alone.

Why Is My SFP Not Working?

GenericSFP

It’s 3 am. You’ve just finished installing your new Catalyst switches into the rack and you’re ready to turn them up and complete your cutover. You’ve been fighting for months to get the funding to get these switches so your servers can run at full gigabit speed. You had to cut some corners here and there. You couldn’t buy everything new, so you’re reusing as much of your old infrastructure as possible. Thankfully, the last network guy had the foresight to connect the fiber backbone at gigabit speeds. You turn on your switches and wait for the interminably long ASIC and port tests to complete. As you watch the console spam scroll up on your screen, you catch sight of something that makes your blood run cold:

%GBIC_SECURITY_CRYPT-4-VN_DATA_CRC_ERROR: GBIC in port 65586 has bad crc
 %PM-4-ERR_DISABLE: gbic-invalid error detected on Gi1/0/50, putting Gi1/0/50 in err-disable state

Huh?!? Why aren’t my fiber connections coming up? Am I going to have to roll the install back? What is going on here?!?

You will see this error message if you have a third party SFP inserted into the Catalyst switch. While Cisco (and many others) OEM their SFP transceivers from different companies, they all have a burned-in chip that contains info such as serial number, vendor ID, and security info like a Cyclic Redundancy Check (CRC). If any of this info doens’t match the database on the switch, the OS will mark the SFP as not supported and disable the port. The fiber connection won’t come up and you’ll find yourself screaming at terminal window at 3:30 in the morning.

Why do vendors do this? Some claim it’s vendor lock in. You are stuck ordering your modules from the vendor at an inflated cost instead of buying them from a different source. Others claim it’s to help TAC troubleshoot the switch better in case of a failure. Still others say that it’s because the manufacturing tolerances on the vendor SFPs is much better than the third party offerings, even from the same OEM. I don’t have the answer, but I can tell you that Cisco, HP, Dell, and many others do this all the time.

HP is the most curious case that I’ve run into. Their old series A SFP modules (HP calls them mini-GBICs) didn’t even have an HP logo. They bore the information from Finisar, an electroics OEM. The above scenario happened to me when I traded out a couple of HP 2848 swtiches for some newer 2610s. The fiber ports locked up solid and would not come alive for anything. I ended up putting the old switches back in place as glorified fiber media converters until I figured out that new SFPs were needed. While not horribly expensive, it did add a non-trivial cost to my project, not to mention all the extra hours of troubleshooting and banging my head against a wall.

Cisco has an undocumented and totally unsupported solution to this problem. Once you start getting the console spam from above, just enter these commands:

service unsupported-transceiver
no errdisable detect cause gbic-invalid

These commands are both hidden, so you can’t ? them. When you enter the first command, you get the Ominous Warning Message of Doom:

Warning: When Cisco determines that a fault or defect can be traced to the use of third-party transceivers installed by a customer or reseller, then, at Cisco’s discretion, Cisco may withhold support under warranty or a Cisco support program. In the course of providing support for a Cisco networking product Cisco may require that the end user install Cisco transceivers if Cisco determines that removing third-party parts will assist Cisco in diagnosing the cause of a support issue.

It goes without saying that calling TAC with a non-Cisco SFP in the slot is going to get you an immediate punt or request to remove said offending SFP. You’ll likely argue that your know the issue isn’t with the SFP that was working just fine an hour ago. They will counter with not being able to support non-Cisco gear. You’ll complain that removing the SFP will create additional connectivity issues and eventually you’ll hang up in frustration. So, don’t call TAC if you use this command. In fact, I would counsel that you should only use this command as a short term band-aid to get your out of the data center at 3 am so you can order genuine SFPs the next morning. Sadly, I also know how budgets work and how likely you are to get several hundred dollars of extra equipment you “forgot” to order. So caveat implementor.

Data Never Lies

lies

If you’ve been watching the media in the last couple of weeks, you’ve probably seen the spat that has developed between John Broder of the New York Times and Elon Musk of Tesla Motors.  Broder took a Tesla Model S sedan on a test drive from New Jersey to Connecticut to test out the theory that the new supercharger stations that have been installed along the way would help electric cars to take long road trips without fear of running out of electricity.  Along the way, he ran into some difficulty and ultimately needed to have the car towed to a charging station.  After the story came out, Elon Musk immediately defended his product with a promise of data to support that assertion.  A couple of days later, he put up a long post on the Tesla blog with lots of charts, claiming that the Model S had lots of data to support longer driving distances, failure to fully charge at supercharger stations, and even that Broder was driving in circles in a parking lot.  After this post, Broder responded with another post of his own clarifying the rebuttal made by Musk and reaffirming how the test was carried out.  It’s certainly made for some interesting press releases and blog posts.  There has also been a greater discussion about how we present facts and dat in a case to support our argument or prove the other party is wrong.

Data Doesn’t Lie

If nothing else, Elon Musk did the right thing by attaching all manner of charts and graphs to his blog post.  He provided data (albeit collated and indexed) from the vehicle that gave a more precise picture of what went on than the recollection of a reporter that admittedly didn’t remember what he did or didn’t do during portions of the test drive.  Data never lies.  It’s a collection of facts and information that tells a single story.  If equals 7, there’s no other thing that could be.  However, the failing in data usually doesn’t come from the data itself.  It comes from interpretation.

Data Doesn’t Lie.  People Do.

The problem with the Elon Musk post is that he used the data to support his assertion that Broder did things like taking a long detour through Manhattan and driving in circles for half a mile in a parking lot in an attempt to force the car to completely discharge its battery.  This is the part where the narrative starts to break down and where most critics are starting their analysis.  Musk was right to include the data.  However, the analysis he offers is a bit wild.  Does rapid acceleration and deceleration over a short span of distance mean Broder was driving in circles attempting to drain the car?  Or was he lost in the dark, trying to find the charging station in the middle of the night like he claims in his rebuttal?  The data can only tell us what the car did.  It can’t explain the intentions of someone that wasn’t being monitored by sensors.

Let The Data Do The Talking

How does this situation apply to us in the networking/virtualization/IT world?  We find ourselves adrift in a sea of data.  We have protocols providing us status information and feeding us statistics around the clock.  We have systems that will correlate that data and provide a big picture.  We have system to aggregate the correlated data and sort it into action items and critical alert levels.  With all this data, it’s very easy for us to make assumptions about what we see.  The human brain wants to make patterns out of what we see in front of us.  The problem comes when the conclusion we reach is incorrect.  We may have a preconceived notion of what we want the data to say.  Sometimes its confirmation bias.  Other times its reporting bias.  We come to incorrect conclusions because we keep trying to make the data tell our story instead of listening to what the data tells us.  Elon Musk wanted the data to tell him (and us) that his car worked just fine and that the driver must have had some ulterior motive.  John Broder used the same data to support that while his recollection of some finer details wasn’t accurate in the original article, he harbored no malice during his test.  The data didn’t lie in either case.  We just have to decide who’s story is more accurate.

Tom’s Take

The smartest thing that you can do when providing network data or server statistics is leave your opinion out of it.  I make it a habit to give all the data I can to the person requesting it before I ever open my mouth.  Sure, people pay me to look at all that information and make sense of it.  Yes, I’ve been biased in my conclusions before.  I realize that I’m nowhere near neutral in many of my interpretations, whether it be defending the actions of myself or my team or using the data to support the correctness of a customer’s assumptions.  The key to preventing a back-and-forth argument is to simply let the data do all the talking for you.  If the data never lies, it can’t possibly lose the argument.  Let the data help you.  Don’t make the data do your dirty work for you.

The Card Type Command – Don’t Flop

If you’ve ever found yourself staring at a VWIC2-1MFT-T1/E1 or a NM1-T3/E3 module, you know that you’ve got some configuration work ahead of you.  Whether it be for a PRI circuit to hook up that new VoIP system or a DS3 to get a faster network connection, the T1/T3 circuit still exists in many places today.  However, I’ve seen quite a few people that have been stymied in their efforts to get these humble interface cards connected to a router.  I have even returned a T1/E1 card myself when I thought that it was defective.  Imagine the egg on my face when I discovered that the error was mine.

It turns out that ordering the T1/E1 or T3/E3 module from Cisco requires a little more planning on the installation side of things.  These cards can have a dual identity because the delivery mechanism for these circuits is identical.  In the case of a T1/E1, the delivery mechanism is almost always over an unshielded twisted pair (UTP) cable.  Almost all of the T3/E3 circuits that I’ve installed have been delivered over fiber but terminated via coax cables with BNC connectors.  The magic, then, is in the location.  A T1 circuit is typically delivered in North America, while the E1 circuit is European version.  There are also differences in the specifics of each circuit.  A T1 is 24 channels of 64kbits each.  An E1 is 32 channels of the same size.  This means that a T1 has an effective data rate of 1.544 Mbits while an E1 is a bit faster a 2.048 Mbits.  There are also framing differences and a slightly different signaling structure.  The long and short of it is that T1 and E1 circuits are incompatible with each other.  So how does Cisco manage to ship a module that supports both circuit types?

The key is that you must choose which circuit you are going to support when you install the card.  The card can’t automatically flip back and forth based on circuit detection.  Where the majority of issues come from in my line of work is that the card doesn’t show up as a configurable interface until you force a circuit type.  This is accomplished by using the card type command:

RouterA(config)#card type ?
 e1 E1
 e3 E3
 t1 T1
 t3 T3

Choose your circuit type and away you go!  As soon as you enter the card type, the appropriate serial interface is created.  You will still need to enter the controller interface to set parameters like the framing and line code.  However, the controller interface only shows up when the card type has been set as well.  So unless you’ve done the first step, there isn’t going to be a place to enter any additional commands.


Tom’s Take

Sometimes there are things that seem so elementary that you forget to do them.  Checking a power plug, flipping a light switch, or even remembering to look for little blinking lights.  We don’t think about doing all the easy stuff because we’re concentrating on the hard problems.  After all our hard work, we know it has to be something really messed up otherwise it would be fixed by now.  In the case of T1/E1 cards, I made that mistake.  I forgot to check everything before declaring the card dead on arrival.  Now, I find myself spending a lot of time providing that voice of reason for others when they’re sure that it has to be something else.  The little voice of reason doesn’t always have to be loud, sometimes it just has to say something at the right time.

Upgrading to Cisco Unified Presence Server 8.6(4) – Caveat Jabber

With the Jabber for Everyone initiative that Cisco has been pushing as of late, the question about compatibility between the Jabber client and Cisco Unified Presence Server (CUPS) has come into question more than once.  Cisco has been pretty clear on the matter since May – you must upgrade to CUPS 8.6(4) [PDF Link] to take advantage of the Jabber for Everyone.  This version was released on June 16th, and being the diligent network engineer that I am, I had already upgraded to 8.6(3) previously.  This week I finally had enough down time to upgrade to 8.6(4) to support Jabber for Everyone as well as some of the newer features of the Jabber client for Windows.  Of course, that’s where all my nightmares started.

I read the release notes and found that 8.6(4) was a refresh upgrade.  I’ve already done one of these previously on my CUCM server so I knew what to expect.  I prepped the upgrade .COP file for download prior to installing the upgrade itself.  Luckily for me, 8.6(3) is the final version prior to the refresh upgrade, so it doesn’t require the upgrade .COP file to perform the upgrade.  The necessary schema extensions and notification fields are already present.  With all of the release note prerequisites satisfied, I fired up my FTP server and began the upgrade process.  As is my standard procedure, I didn’t let the server upgrade to the new version automatically.  I figured I’d let the upgrade run for a while and reboot afterwards.  After a couple of hours, I ordered the server to reboot and perform the upgrade.  Imagine my surprise when the server came back up with 8.6(4) loaded, but none of the critical services were running.  Instead, the server reported that only backups and restorations were possible.  I was puzzled at this, as the upgrade had appeared to work flawlessly.  After tinkering with things for a bit, I decided to revert my changes and roll back to 8.6(3).  After a quick reboot, the old version came back up.  Only this time, the critical services were stuck in the “starting” state.  Seemed that I was doomed either way.  After I verified the MD5 checksum of the upgrade file, I started the upgrade for the second time.  While I waited for the server to install the second time, I strolled over to the Internet to find out if anyone was having issues with this particular upgrade.

After some consulting, it turns out that Cisco pulled a bone-headed mistake with this upgrade.  Normally, one can be certain that any hardware-specific changes will be contained to major version upgrades.  For instance, upgrading from Windows XP to Windows 7 might entail hardware requirement changes, like additional RAM.  Point releases are a little more problematic.  Cisco uses the minor version to constitute bi-annual system releases.  So CUCM 8.0 had a certain set of hardware requirements, but CUCM 8.5 had different ones.  In that particular case, it was a higher RAM requirement.  However, for CUPS 8.6(4), the RAM requirement doubled to 4 GB.  For a sub-minor point release.  Worse yet, this information didn’t actually appear in the release notes themselves.  Instead, I had to stumble across a totally separate page that listed specific hardware requirements for MCS server types.  Even within that page, the particular model of server that I am using (MCS-7825-I3) is listed as compatible (with caveats).  Turns out that any 8.6(x) release is supposed to require more than 4GB of RAM to function correctly.  Except I was able to install 8.6(3) with no issues on 2GB of RAM.  Since I knew I was going to need to test 8.6(4), I rummaged around the office until I was able to dig up the required RAM (PC2-5300 ECC in case you’re curious).  Without the necessary amount of RAM, the server will only function in “bridge mode” for migrations to new hardware.  This means that your data is still intact on the CUPS server, but none of the services will start to begin processing user requests.  At least knowing that might prevent some stress.

For those of you that aren’t lucky enough to have RAM floating around the office and you’ve gotten as far as I have, reverting the server back to 8.6(3) isn’t the easiest thing to do.  Turns out that moving back to 8.6(x) from 8.6(4) requires a little intervention.  As found on the Cisco Support Forums, rolling back can only be accomplished by installing the ciscocm.cup.pe_db_install.cop file.  But there are two problems.  First, this file is not available anywhere on Cisco’s website.  The only way you can get your hands on it is to request it from TAC during a support call.  That’s fortunate, because problem number 2 is that the file is unsigned.  That means that it will fail the installation integrity check when you try to install it on the CUPS server.  You have to have TAC remote connect to the server and work some support voodoo to get it working.  Now, I suppose if you have a way to gain root access to a Cisco Telephony OS shell, you could do something like the outlined steps in the forum post (as follows):

Here what's required to temporarily install unsigned COP files

cd /usr/local/bin/base_scripts
mv SIGNED_FILTER SIGNED_TEMP

Here what's required in Remote Access to remove the temporary fix

cd /usr/local/bin/base_scripts
mv SIGNED_TEMP SIGNED_FILTER

Note: This is totally unsupported by me.  I’m putting it here for posterity.  Don’t call me if you blow up your server.  Also, I don’t have the TAC .COP file either, so don’t bother asking for it.

That being said, the above instructions should get you back up and running on 8.6(3) until you can buy some RAM from Newegg or your other preferred vendor.


Tom’s Take

Yes, I should have read the release notes a little more closely.  Yes, I should have verified the compatibility before I ran wild with this upgrade.  However, having fallen on my sword for my own mistakes, I think it’s well within my rights to call Cisco out on this one as well.  How do you not put a big, huge, blinking red line in the release notes warning people that you need to check the amount of RAM in the server before performing an upgrade?  You figure something like this would be pretty important to know?  Worse yet, why did you do this on a sub-minor point release?  When I install Windows 7 Service Pack 1 or OS X 10.7.4, I feel pretty confident that the system requirements for the original OS version will suffice for the minor service pack.  Why up the hardware requirements for CUPS for a minor upgrade at best?  Especially one that you’re driving all your people to be on to support your big Jabber initiative?  Why not hold off on the requirement until the CUCM 9 system release became final (which happened about a week later)?  If I’m moving from 8.6 to 9.0, I would at least expect a bunch of hardware to be retired and for things to not work correctly when moving to a new, big major version.  From now on I’m going to be a lot more careful when checking the release notes.  Cisco, you should be a lot more diligent in using the release notes to call attention to things that are important for that release.  The more caveats we know about up front, the less likely we are to jabber about them afterwards.

Fix The Problem, Not The Blame

Courtesy of Zazzle.com

Ethan Banks is really turning out some good blog posts as of late.  His latest one about failure in particular really got me to thinking.  You should head over and read it before you continue.

After I read through Ethan’s post, I started thinking about why people tend to shift responsibility and fire up the “blamethrower” from time to time.  It reminded me of Rising Sun, a movie based on a Michael Crichton book of the same name.  The movie in particular stands out to me because of a quote from Sean Connery:

“The Japanese have a saying: ‘Fix the problem, not the blame.’ Find out what’s [screwed] up and fix it.  Nobody gets blamed.  We’re always after who [screwed] up.  Their way is better.”

This is the kind of thing that leads to people shirking failure.  People are so worried about getting blamed for things that they won’t admit to them.  Whether it be for something simple like misspelling someone’s name or something major like crashing the core router, people don’t want to get blamed.  Most of the time, I can’t fault them for that.  Think about what happens when something goes wrong.  More often than not someone higher up in the organization starts head hunting.  They stalk the halls asking, “Whose fault is this? I want them in my office now!”  How many times have you seen a situation where yelling at the responsible party took precedence over fixing things?  As a VAR providing support to multiple different types of customers, I can tell you that I’ve witnessed first hand several occasions where my job couldn’t begin until the responsible parties were dealt with.  Precious seconds and minutes can tick by while blame is appropriately assigned.

Personally, I take the opposite approach to things.  When I find myself in a situation of troubleshooting or solving problems, I make sure that blame is the last thing that is discussed.  When the CxO comes stalking through the office looking for someone to yell at, I always make sure to direct attention away from the people doing the work.  In my mind, the key to any successful problem resolution lies not in assigning blame but in fixing the problem.  After the crisis is over and cooler heads are prevalent is the time to begin examining causes discussing resolutions to prevent repeat performances.  The above quote from Rising Sun not only reflects my views about the uselessness of blame in a professional environment but serves to show how useful and refreshing fixing problems can be.  At times, I even assume more blame than necessary if it means moving things along.  My goal as a network engineer is problem resolution, not blame assignment.  That’s not to say that I won’t give someone a stern reprimand if necessary.  I’d just rather not have that happening in the heat of the moment when the network team is trying their best to keep the core from melting into a pile of slag.

To be an effective problem solver, make sure to focus all your efforts on fixing the problems.  By forcing all the stakeholders to expend their efforts on the real source of stress, your reputation will grow into something amazing.  People will talk about your ability to solve any problem.  They’ll comment that you’re cool under pressure and great at motivating people when things are at their worst.  You’ll be known as the person that solves problems quickly and makes sure that your team knows what went wrong to prevent it from happening in the future.  These are all very desired traits for people in a troubleshooting capacity.  They can all be yours provided you spend your time looking at the real issues and not worrying about those that are generated from them.

Minimizing MacGyver

I’m sure at this point everyone is familiar with (Angus) MacGyver.  David Lee Zlotoff created a character expertly played by Richard Dean Anderson that has become beloved by geeks and nerds the world over.  This mulletted genius was able to solve any problem with a simple application of science and whatever materials he had on hand.  Mac used his brains before his brawn and always before resorting to violence of any kind.  He’s a hero to anyone that has ever had to fix an impossible problem with nothing.  My cell phone ringtone is the Season One theme song to the TV show.  It’s been that way ever since I fixed a fiber-to-Ethernet media converter with a paper clip.  So it is with great reluctance that I must insist that it’s time network rock stars move on from my dear friend MacGyver.

Don’t get me wrong.  There’s something satisfying about fixing a routing loop with a deceptively simple access list.  The elegance of connecting two switches back-to-back with a fiber patch cable that has been rigged between three different SC-to-ST connectors is equally impressive.  However, these are simply parlor tricks.  Last ditch efforts of our stubborn engineer-ish brains to refuse to accept failure at any cost.  I can honestly admit that I’ve been known to say out loud, “I will not allow this project to fail because of a missing patch cable!”.  My reflexes kick in, and before I know it I’m working on a switch connected to the rest of the network by a strange combination of bailing wire and dental floss.  But what has this gained me in the end?

Anyone that has worked in IT knows the pain of doing a project with inadequate resources or insufficient time.  It seems to be a trademark of our profession.  We seem like miracle workers because we can do the impossible from less than nothing.  Honestly though, how many times have we put ourselves into these positions because of hubris or short-sightedness?  How many times have we equivocated to ourselves that a layer 2 switch will work in this design?  Or that a firewall will be more than capable of handling the load we place on it even if we find out later that the traffic is more than triple the original design?  Why do we subject ourselves to these kinds of tribulations knowing that we’ll be unhappy unless we can use chewing gum and duct tape to save the day?

Many times, all it takes is a little planning up front to save the day.  Even MacGyver does it. I always wondered why he carried a roll of duct tape wherever he went.  The MacGyver Super Bowl Commercial from 2009 even lampooned his need for proper preparation.  I can’t tell you the number of times I’ve added an extra SFP module or fiber patch cable knowing that I would need it when I arrived on site.  These extra steps have saved me headaches and embarrassment.  And it is this prior proper planning that network engine…rock stars must rely on in order to do our jobs to the fullest possible extent.  We must move away from the bailing wire and embrace the bill of materials.  No longer should we carry extra patch cables.  Instead we should remember to place them in the packages before they ship.  Taking things for granted will end in heartache and despair.  And force us to rely less on our brains and more on our reflexes.

Being a Network MacGyver makes me gleam with pride because I’ve done the impossible.  Never putting myself in the position to be MacGyver makes me even more pleased because I don’t have to break out the duct tape.  It means that I’ve done all my thinking up front.  I’m content because my project should just fall into place without hiccups.  The best projects don’t need MacGyver.  They just need a good plan.  I hope that all of you out there will join me in leaving dear Angus behind and instead following a good plan from the start.  We only make ourselves look like miracle workers when we’ve put ourselves in the position to need a miracle.  Instead, we should dedicate ourselves to doing the job right before we even get started.

Moving to CUCM 8.6 – You’ll Never Upgrade Me Alive COPpers!

Upgrades are a fact of life for network rock stars.  Whether we are patching bugs or adding new features to our systems, the installation of software never seems to end.  If you are a Cisco voice rock star, you all too often find yourself upgrading to newer releases of Cisco Unified Communications Manager (CUCM) to support new devices like the Cius or fix show stopping bugs like the 180-day uptime lockup.  However, if you are a user of CUCM 8.x and you’re trying to move to 8.6, you’ve probably had a couple of head scratching moments so far.

If you’ve popped a freshly burned 8.6(1) ISO into your DVD drive or copied it via SFTP, you kicked off you installation and likely saw the following error message:

09/18/2011 19:31:48 refresh_upgrade|********** Upgrade Failed **********|<LVL::Info>
09/18/2011 19:31:48 refresh_upgrade|*** Please install the Refresh Upgrade COP, and reattempt the upgrade ***|<LVL::Info>
09/18/2011 19:31:48 refresh_upgrade|************************************|<LVL::Info>

Huh? What’s a refresh upgrade?  Why isn’t this ISO file working?  Well, it turns out Cisco needs you to take an additional step first.

CUCM runs on an operating system.  Up until version 5, that was Windows 2000 with some hardening and customizations.  Cisco eventually ported CallManager 4.3 to Windows Server 2003, but in the end the decision was made to move to an appliance-based OS that utilized Linux.  The Telephony OS in CUCM 5.x was new for those used to working in Windows but somewhat familiar to those that have seen Linux before, even if the login shell looked nothing like bash.  Cisco provides patches for the OS with every release of CUCM software and the user never knows what’s going on because of the way the system installs the patches transparently.  However, much like the shift from Windows 2000 to Windows 2003, software eventually reaches the end of its life.  Development stops on the old version and it’s time to move to the new one.  Such is the case in CUCM.  With version 8.6, Cisco has moved away from an OS platform based on Redhat Enterprise Linux (RHEL) 4 and upgraded the underlying OS to RHEL 5.  This is good news that allows the system to stay current and support a larger variety of hardware.  The bad news is that the upgrade of the OS can be a bit destructive.  This is part of the reason for the extra steps in moving to CUCM 8.6

Firstly, Cisco wants you to install a special Cisco Options Package (COP) file on 8.5(1) systems.  This file is ciscocm.refresh_upgrade_v1.0.cop.sgn.  The 8.6 installer checks for the presence of this file and won’t kick off unless it’s present.  It needs to be installed on every server in the cluster.  It’s also going to reboot the server after installation.  As near as I can tell, it makes some changes to the Tomcat service on the server as well as adding two new fields to the Install/Upgrade window:


Notice the new options for email.  This allows the server to send you an email whenever the upgrade is completed.  Probably a long overdue option that comes in handy for those of us that spend more than a few stress-filled moments clicking the Refresh buttons on our web browsers waiting for CUCM to come back to life after an upgrade.  There’s another reason for putting this email field in here now, though.

It turns out that when you upgrade from 8.5 to 8.6, its going to take a while.  Quite a while, in fact.  The system is going to reboot no less than twice, perhaps even three times.  Considering that a CUCM reboot can take 15-20 minutes to complete each time during an upgrade, you’re looking at nearly an hour of rebooting time under certain circumstances.  During the upgrade, CUCM is going to do things in 3 phases:

Phase 1: Export all the pertinent CUCM data to a safe partition

Phase 2: Reboot and install RHEL 5, then reboot and install the CUCM applications

Phase 3: Import all the data from the export partition

On the 7825H3 MCS server, there isn’t enough hard drive space to contain the safe partition during the reformat and installation of CUCM 8.6.  In that case, you’re going to need to plug a 16 GB USB drive into the system to serve as a target for the data export.  If you’re trying to upgrade a CUCM Business Edition system on a 7828H3 server, you better bust out the credit card because you’re going to need a 128 GB USB drive to hold all the CUCMBE data during the upgrade.  The IBM servers aren’t affected by this little caveat, as I’ve done the 8.6 refresh upgrade on a 7825I4 and not had any issues.  Be sure to leave the USB drives plugged in the whole time the system is upgrading.  Also, whatever is on the drive is going to be overwritten without warning, so be sure it’s blank before you start.

After you’ve completed the whole installation with all the reboots, you’re going to have a fresh new system with CUCM 8.6 to support all kinds of wonderful things, like finally being able to use Google Chrome to administer things.

Tom’s Take

I kicked off an upgrade to 8.6 without reading the release notes or documentation.  Thankfully Cisco prevented me from screwing things up big time by halting the installation with the above error message.  The more I dug into things, the more interesting it was.  It also took me two hours to finish things up with many reboots and even more nail biting (Fun fact: I was doing the upgrade during Packet Pushers Show 56, which is one of the reasons why I was quiet – I was trying not to scream at my CUCM server).  However, I think I could have avoided some pain and stress if I’d just read the docs first or even searched for refresh upgrade before I got started.

Unable To Access User-Defined Storage Service

In my VMware vSphere: What’s New [5.0] class this week, I learned why having a lab environment to test things is very important.  I also learned that some bugs are fun to try and fix.

vSphere 5 introduced a lot of new features focused on storage.  One of these is Profile Driven Storage.  This allows users to create tiers for datastores and ensure that those profiles can be attached to VMs at a later date.  This would be very useful for someone that has ultra-fast SSD arrays like those from PureStorage alongside SAS or SATA arrays.  You can define the gold tier as the SSD array for VMs that need fast storage access, silver tier for slightly slower SAS drives and bronze tier for the large-but-slow SATA datastore.  I like this idea of allowing users to define their storage capabilities into easy to assign tiers.  However, we hit a bug when we tried to implement it in the lab.

After we created the tiers in VIClient, we went to assign them to the datastores from the Home -> Datastores and Datastore Clusters section.  When we right clicked on the datastore and chose “Assign User-Defined Storage Capability” we got hit with this error:

Unable To Access User-Defined Storage Service

Huh?  You let me configure the silly thing?  It’s got to be there somewhere!  Let me assign it to something.

Odds are good that if you are seeing this error, you’ve also installed the vSphere Web Client.  Another great option for users that don’t want to install the VMware Infrastructure Client, the Web Client allows you to access VMs from Firefox or Internet Explorer and manage them just like you would from the VIClient.  This would be useful for those out there that are running OS X and currently don’t have a way to manage VMs unless they launch the VIClient from a virtual machine or other emulated environment.  The Web Client software needs to be installed on a Windows (or Linux) machine in order to respond to requests from web browsers.  For many users that run OS X, the logical choice would be to install the Web Client service on the Windows-based vCenter Server and then use Firefox to remotely access the web client afterwards.  That’s what we did in the lab.

The problem lies in that the Web Client service conflicts with the Profile Driven Storage service.  I’m not sure if they use the same port numbers or if they just collide in memory space or something.  As long as the Web Client service is running, the Profile Driven Storage options cannot be configured on a Data Store.  The fix is somewhat simple:

1.  Open the Service console on your vCenter server.

2.  Find the VMware Web Client service.

3.  Stop or disable it.

4.  Restart VIClient.

Simple, huh?  You can now assign the User-Defined Storage profiles to all the datastores you’d like.  When you finish, close out VIClient and restart the Web Client Service so your Mac folks can administer VMs.  Just remember that every time you want to use Profile Driven Storage, you’re going to have to bounce the Web Client service.

One can only hope that this particular bug gets fixed in an upcoming point release of vSphere 5.  Not a show stopper, but I can see how it could cause issues for those that don’t know from the less-than-helpful error message where to look for help.  I’m just glad I found it in a learning lab and not in production.

Crunch Time

Everyone in IT has been there before.  The core switches are melting down.  The servers are formatting themselves.  Packets are being shuffled off to their doom.  Human sacrifce, dogs and cats living together, mass hysteria.  You know, the usual.  What happens next?

Strangely enough, how IT people react to stressful situations such as these has become a rather interesting study habit of mine.  I know how I react when these kinds of things start happening.  I go into my own “panic mode”.  It’s interesting to think about what changes happen when the stress levels get turned up and problems start mounting.  I start becoming short with people.  Not yelling or screaming, per se.  I start using short declarative sentences at an elevated tone of voice to get my point across.  I being looking for solutions to problems, however inelegant they may be.  Quick fixes rule over complicated designs.  I’ve trained myself to eliminate the source of stress or the cause of the problem.  I tend to tune out any other distractions until the issues at hand are sorted out.  Should I find myself in a situation where I can effect a solution to the problem, or if I’m waiting on someone or something to happen outside my directly control, that is the time when the stress starts mounting.  To those that share my “can do” attitude, this makes me look efficient and helpful in times of crisis.  To others, I look like a complete jerk.

I’ve also found that there are others in IT (and elsewhere) that have an entirely different method of dealing with stress: they shut down.  My observations have shown that these people become overwhelmed with the pressure of the situation almost immediately and begin finding ways to cope through indirect action.  Some begin blaming the problem on someone or something else.  Rather than search out the source of the trouble, they try to pin it on someone other than them, maybe in the hopes they won’t have to deal with it.  These people begin to withdraw into their own world.  They sit down and stare off into space.  They become quiet.  Some of them even break down and start to cry (yes, I’ve seen that happen before).  Until the initial shock of the situation has passed, they find themselves incapable of rendering any kind of assistance.

How do we as IT professionals deal with these two disparate types of panic modes?  You need to work out how to do that now so that you don’t have to come up with things on the fly when the core switches are dropping packets and the CxOs are screaming for heads, which is funny that the second category of blamers and inaction people always seem to be in management.

For people like me, the “doers”, we need to be doing something that can impact the problem.  No busy work, no research.  We need to be attacking things head-on.  Any second we aren’t in attack mode compounds the stress we’re under.  Even if we try a hundred things and ninety nine of them fail, we have to try to keep from going crazy.  Think of these “doers” like a wind-up toy: get us working on something and let us go.  You might not want to be around us while we’re working, lest you want some curt answers followed by looks of distaste when we have to stop and explain what we’re doing.  We’ll share…when we’re done.

For the other type of people, those that have a stress-induced Blue Screen of Death (BSoD), I’ve found that you have to do something to get them out of their initial funk.  Sometimes, this involves busy work.  Have them research the problem.  Have them go get coffee.  In most cases, have them do something other than be around you while you’re troubleshooting.  Once you can get them past the blame/sulk/cry state, they can become a useful resource for whatever needs to happen to get the problem solved.  Usually, they come back to me later and thank me for letting them help.  Of course, they also usually tell me I was a bit of an ass and should really be nicer when I’m in panic mode.  Oh well…

Tom’s Take

I don’t count on anyone in a stressful situation that isn’t me.  Most often, I don’t have the luxury of time to figure out how a person is going to react.  If you can help me I’ll get you doing something useful.  If not, I’m going to ignore or marginalize you until the problem is fixed.  Over the last couple of years, though, I’ve found that I really need to start working with every different group to ensure that communications are kept alive during stressful situations and no one’s feelings get hurt (even though I don’t normally care).  By consciously realizing that people generally fall into the “doer” or “BSoD” category, I can better plan for ways to utilize them when the time comes and make sure that the only thing going CRUNCH at crunch time is the problem.  And not someone’s head.