Intel’s Ticking Atom Bomb

clock

It started somewhat innocently. Cisco released a field notice that there was an issue with some signal clocks on a range of their networking devices. This by itself was a huge issue. There had been rumblings about this issue for a few months. Some proactive replacement of affected devices to test things. Followed by panicked customer visits when the news broke on February 2nd. Cisco looked like they were about to get a black eye.

The big question that arose was whether or not this issue was specific to Cisco devices or if it was an issue that was much bigger. Some investigative work from enterprising folks like Tony Mattke (@tonhe) found that there was a spec document from Intel that listed a specific issue with the Intel Atom C2000 System on Chip (SoC) that caused it to fail to provide clock signal for onboard chips. The more digging that was done, the more dire this issue turned out to be.

Tick, Tick, Tick

Clock signaling is very important in modern electronics. It ensures that all the chips on the board are using the correct timing to process electronic impulses. If the clock signal starts drifting, you start getting “glitches” in the system. Those glitches are unpredictable results in the outputs. The further out of phase the clock drifts, the more unpredictable the results in the output. Clock signals are setup to keep things on task.

In this case, when the Intel Atom C2000 is installed in a system, it’s providing the clock signal for the Low Pin Count bus. This is the bus that tends to contain simple connections, like serial ports or console ports. But, one of the other things that often connects to the LPC bus is the boot ROM for a system. If the C2000 dies, it denies access to the boot ROM of the device. That’s why the problem is usually not apparent until the device reboots. The boot ROM wouldn’t be accessed otherwise.

The other problem here is that the issue doesn’t have any telltale markers. It just happens one day. When the C2000 dies, it’s gone for good. Intel is working on a way to try and fix the chips already in the wild, but even they are admitting that the fix is only temporary and the real solution is getting new chips and boards in place. Cisco has already embarked on a replacement program, as have the numerous other manufacturers that have used the C2000 chips.

Laying Landmines

This does cast some shade on the future of merchant silicon usage in devices. Back when this appeared to be a simple issue with Cisco devices only, people were irritated at Cisco’s insistence that custom fabricated silicon was the way to go. But now that the real culprit appears to be Intel, it should give switch vendors pause about standardizing on a specific platform.

What if this issue had been present in Broadcom Trident or Tomahawk chips? What if a component of OCP or Wedge had a hidden fault? The larger the install base of a particular chip, the more impactful the outage could be. Imagine a huge Internet of Things (IoT) deployment that relies on a weak link in the chain that can fail at any point and brick the device in question. The recall and replacement costs would be astronomical. Even for Cisco in this case, replacing the subset of affected devices is going to be very costly.


Tom’s Take

Reliance on single components for these kinds of applications is a huge risk, but it’s also good business. Intel provides the C2000 at low cost because it’s designed to be widely deployed. Just like any of the other components you’d find at a Fry’s or old Radio Shack, they are mass produced to serve a purpose. As we move more toward merchant silicon and whitebox as the starting point for deliver value in switching, we have to realize that we need to stay on top of the components themselves instead of just taking the hardware for granted. Because one little glitch here and there can lead to a lot of trouble down the road. And the clock is always ticking on things like this.

Are We Living In A Culture Of Beta?

Cisco released a new wireless LAN controller last week, the 5760.  Blake and Sam have a great discussion about it over at the NSA Show.  It’s the next generation of connection speeds and AP support.  It also runs a new version of the WLAN controller code that unifies development with the IOS code team.  That last point generated a bit of conversation between wireless rock stars Scott Stapleton (@scottpstapleton) and George Stefanick (@wirelesssguru) earlier this week.  In particular, a couple of tweets stood out to me:

Overall, the amount of features missing from this new IOS-style code release is bordering on the point of concern.  I understand that porting code to a new development base is never easy.  Being a fan of video games, I’ve had to endure the pain of watching features be removed because they needed to be recoded the “right way” in a code base instead of being hacked together.  Cisco isn’t the only culprit in this whole mess.  Software quality has been going downhill for quite a while now.

Our culture is living in a perpetual state of beta testing.  There’s lot of blame to go around on this.  We as consumers and users want cutting edge technology.  We’re willing to sacrifice things like stability or usability for a little peak at future awesomeness.  Companies are rushing to be the first-to-market on new technologies.  Being the first at anything is an edge when it comes to marketing and, now, patent litigation.  Producers just want to ship stuff.  They don’t really care if it’s finished or not.

Stability can be patched.  Bugs can be coded out in the next release.  What’s important is that we hit our release date.  Who cares if it’s an unrealistic arbitrary day on the calendar picked by the marketing launch team?  We have to be ready otherwise Vendor B will have their widget out and our investors will get mad and sell off the stock!  The users will all leave us for the Next Big Thing and we’ll go out of business!!!  

Okay, maybe not every conversation goes like that, but you can see the reasoning behind it.

Google is probably the worst offender of the bunch here.  How long was GMail in beta?  As it turns out…five years.  I think they probably worked out most of the bugs of getting electronic communications from one location to another after the first nine months or so.   Why keep it in beta for so long?  I think it was a combination of laziness and legality.  Google didn’t really want to support GMail beyond cursory forum discussion or basic troubleshooting steps.  By keeping it “beta” for so long, they could always fall back to the excuse that it wasn’t quite finished so it wasn’t supposed to be in production.  That also protected them from the early adopters that moved their entire enterprise mail system into GMail.  If you lost messages it wasn’t a big deal to Google.  After all, it’s still in beta, right?  Google’s reasoning for finally dropping the beta tag after five years was that it didn’t fit the enterprise model that Google was going after.  Turns out that the risk analysts really didn’t like having all their critical communication infrastructure running through a project with a “beta” tag on it, even if GMail had ceased being beta years before.

Software companies thrive off of getting code into consumer’s hands.  Because we’ve effectively become an unpaid quality assurance (QA) platform for them.  Apple beta code for iOS gets leaked onto the web hours after it’s posted to the developer site.  There’s even a cottage industry of sites that will upload your device’s UDID to a developer account so you can use the beta code.  You actually pay money to someone for the right to use code that will be released for free in a few months time.  In essence, you are paying money for a free product in order to find out how broken it is.  Silly, isn’t it?  Think about Microsoft.  They’ve started offering free Developer Preview versions of new Windows releases to the public.  In previous iterations, the hardy beta testers of yore would get a free license for the new version as a way of saying thanks for enduring a long string of incremental builds and constant reloading of the OS only to hit a work-stopping bug that erased your critical data. Nowadays, MS releases those buggy builds with a new name and people happily download them and use them on their hardware with no promise of any compensation.  Who cares if it breaks things?  People will complain about it and it will get fixed.  No fuss, no muss.  How many times have your heard someone say “Don’t install a new version of Windows until the first service pack comes out”?  It’s become such a huge deal that MS never even released a Service Pack for Windows 7, just an update rollup.  Even Cisco’s flagship NX-OS on the Nexus 7000 series switches has been accused of being a beta in progress by bloggers such as Greg Ferro (@etherealmind) in this Network Computing article (comment replies).  If the core of our data center is running on buggy unreliable code, what hope have we for the desktop OS or mobile platform?

That’s not to say that every company rushes products out the door.  Two of the most stalwart defenders of full proper releases are Blizzard and Valve.  Blizzard is notorious for letting release dates slip in order to ensure code quality.  Diablo 2 was delayed several times between the original projected date of December 1998 and its eventual release in 2000 and went on to become one of the best selling computer games of all time.  Missing an unrealistic milestone didn’t hurt them one bit.  Valve has one of the most famous release strategies in recent memory.  Every time someone asks found Gabe Newell when Valve will release their next big title, his response is almost always the same – “When it’s done.”  Their apparent hesitance to ship unfinished software hasn’t run them out of business yet.  By most accounts, they are one of the most respected and successful software companies out there.  Just goes to show that you don’t have to be a slave to a release date to make it big.

Tom’s Take

The culture of beta is something I’m all too familiar with.  My iDevices run beta code most of the time.  My laptop runs developer preview software quite often.  I’m always clamoring for the newest nightly build or engineering special.  I’ve mellowed a bit over the years as my needs have gone from bleeding edge functionality to rock solid stability.  I still jump the gun from time to time and break things in the name of being the first kid on my block to play with something new.  However, I often find that when the final stable release comes out to much fanfare in the press, I’m disappointed.  After all, I’ve already been using this stuff for months.  All you did was make it stable?  Therein lies the rub in the whole process.  I’ve survived months of buggy builds, bad battery life, and driver incompatibility only to see the software finally pushed out the door and hear my mom or my wife complain that it changed the fonts on an application or the maps look funny now.  I want to scream and shout and complain that my pain was more than you could imagine.  That’s when I usually realize what’s really going on.  I’m an unpaid employee fixing problems that should never even be in the build in the first place.  I’ve joked before about software release names, but it’s sadly more true than funny.  We spend too much time troubleshooting prerelease software.  Sometimes the trouble is of our own doing.  Other times it’s because the company has outsourced or fired their whole QA department.  In the end, my productivity is wasted fixing problems I should never see.  All because our culture now seems to care more about how shiny something is and less about how well it works.

Mountain Lion PL-2303 Driver Crash Fix

Now that I’ve switched to using my Mac full time for everything, I’ve been pretty happy with the results.  I even managed to find time to upgrade to Mountain Lion in the summer.  Everything went smoothly with that except for one little hitch with a piece of hardware that I use almost daily.

If you are a CLI jockey like me, you have a USB-to-Serial adapter in your kit.  Even though the newer Cisco devices are starting to use USB-to-mini USB cables for console connections, I find these to be fiddly and problematic at times.  Add in the amount of old, non-USB Cisco gear that I work on regularly and you can seem my need for a good old fashioned RJ-45-to-serial rollover cable.  My first laptop was the last that IBM made with a built-in serial port.  Since then, I’ve found myself dependent on a USB adapter.  The one that I have is some no-name brand, but like most of those cables it has the Prolific PL-2303 chipset.  This little bugger seems to be the basis for almost all serial-to-USB connectivity except for Keyspan adapters.  While the PL-2303 is effective and cheap, it’s given me no end of problems over the past couple of years.  When I upgraded my Lenovo to Windows 7 64-bit, the drivers available at the time caused random BSOD crashes when consoled into a switch.  I could never nail down the exact cause, but a driver point release fixed it for the time being.  When I got my Macbook Air, it came preinstalled with Lion.  There were lots of warnings that I needed to make sure to upgrade the PL-2303 drivers to the latest available on the Prolific support site in order to avoid problems with the Lion kernel.  I dutifully followed the directions and had no troubles with my USB adapter.  Until I upgraded to Mountain Lion.

After I upgraded to 10.8, I started seeing some random behaviors I couldn’t quite explain.  Normally, after I’m done consoling into a switch or a router, I just close my laptop and throw it back in my bag.  I usually remember after I closed it and put it to sleep that I need to pull out the USB adapter.  After Mountain Lion, I was finding that I would open my laptop back up and see that it had rebooted at some point.  All my apps were still open and had the data preserved, but I found it odd that things would spontaneously reboot for no reason.  I found the culprit one day when I yanked the USB adapter out while my terminal program (ZTerm) was still open.  Almost instantly, I got a kernel panic followed by a reboot.  I had finally narrowed down my problem.  I tried closing ZTerm before unplugging the cable and everything behaved as it should.  It appeared the the issue stemmed from having the terminal program actively accessing the port then unplugging it.  I searched around and found that there were a few people reporting the same issue.  I even complained about it a bit on Twitter.

Santino Rizzo (@santinorizzo) heard my pleas for sanity and told me about a couple of projects that created open source versions of the PL-2303 driver.  Thankfully, someone else had noticed that Prolific was taking their sweet time updating things and took matters into their own hands.  The best set of directions to go along with the KEXT that I can find are here:

http://www.xbsd.nl/2011/07/pl2303-serial-usb-on-osx-lion.html

For those not familiar with OS X, a KEXT is basically a driver or DLL file.  Copying it to System/Library/Extensions places in in the folder where OS X looks for device drivers.  Make sure you get rid of the old Prolific driver if you have it installed before you install the OS PL-2303 driver.  Once you’ve run the commands list on the site above, you should be able to plug in your adapter and then unplug it without any nasty crashes.  One other note – the port used to connect in ZTerm changed when I used the new drivers.  Instead of it being /dev/USBSerial or something of that nature, it’s now PL2303-<random digits>.  It also changed the <random digits> when I moved it from one USB port to another.  Thankfully for me, ZTerm remembers the USB ports and will try them all when I launch it until it find the right adapter.  There is some discussion in the comments of the post above about creating a symlink for a more consistent pointer.


Tom’s Take

Writing drivers is hard.  I’ve seen stats that say up to 90% of all Windows crashes are caused by buggy drivers.  Even when drivers appear to work just fine, things can be a little funny.  Thankfully, in the world of *NIX, people that get fed up with the way things work can just pull out their handy IDE and write their own driver.  Not exactly the easiest thing in the world to do but the results speak for themselves.  When the time comes that vendors either can’t or won’t support their hardware in a timely fashion, I take comfort in knowing that the open source community is ready to pick up the torch and make things work for us.

CSCsb42763, or Why I Hate Hunt Groups Sometimes

Ah, CSCsb42763.  My old nemesis.  Those of you in the voice realm may never have heard of this little jewel.  However, you may have heard of its more common name: “Can you make it so that all the members of a hunt group can pick up a call when it’s ringing (call pickup groups)?”  Fine, you say.  You enable the call pickup group for each phone.  Except being logged into a hunt group negates that particular behavior for calls ringing in the hunt group.  Okay, so I’ll just enable the call pickup group on the hunt pilot…ARGGGGGGGGGGGGGGG!!!!!  IT’S NOT THERE!!!!!!!!!!!!!

My friends, say hello to CSCsb42763.  This particular bug has been around since CallManager 4.0(2a)SR1.  According to the TAC Case Collection identifier (http://www.ciscotaccc.com/kaidara-advisor/voice/showcase?case=K11612191), this behavior caused a memory leak in 4.x versions of CallManager and was disabled by default to keep the calls to TAC down to a minimum.  But wait!  This is 2010?  Why hasn’t this been fixed/patched/enabled?  We’ve moved off of Windows and we’re now on our fourth major generation of appliance *coughcoughLinuxcough* support.  I honestly can’t say why this is still an issue.  If I knew exactly how TAC worked I wouldn’t be doing the job I am doing now.  Or I’d be getting paid a LOT more for it.

How do I fix this little behavior without going crazy?  Well, it’s time to get acquainted with the mysterious “Cisco Support Use 1” parameter:

1.  Log into CallManager and click on System -> Enterprise Parameters.

2.  Scroll down until you find the Cisco Support use section (It’s toward the bottom).  Under there, you’ll find the field “Cisco Support Use 1” (Yeah, real descriptive there, guys.  Why didn’t you just call it “DON’T TOUCH THIS UNLESS WE TELL YOU!!!!!!!!!!!”)

3.  Enter “CSCsb42763” (without quotes) exactly as it appears. Yes, case and spelling are very important.  The system is looking for this exactly string to enable the features that you seek.

4.  Go to the hunt group in question and you should now see the “Call Pickup Group” field.

BTW, if you are still on CallManager 4.2 this field has a different name, naturally.  You’ll find it listed under the more descriptive “Unsupported Pickup” parameter under the CCMAdmin parameters page.  But, really, if you’re still on 4.2 you’re probably using tin cans and string to talk to each other and don’t really need hunt groups, now do you?