It started somewhat innocently. Cisco released a field notice that there was an issue with some signal clocks on a range of their networking devices. This by itself was a huge issue. There had been rumblings about this issue for a few months. Some proactive replacement of affected devices to test things. Followed by panicked customer visits when the news broke on February 2nd. Cisco looked like they were about to get a black eye.
The big question that arose was whether or not this issue was specific to Cisco devices or if it was an issue that was much bigger. Some investigative work from enterprising folks like Tony Mattke (@tonhe) found that there was a spec document from Intel that listed a specific issue with the Intel Atom C2000 System on Chip (SoC) that caused it to fail to provide clock signal for onboard chips. The more digging that was done, the more dire this issue turned out to be.
Tick, Tick, Tick
Clock signaling is very important in modern electronics. It ensures that all the chips on the board are using the correct timing to process electronic impulses. If the clock signal starts drifting, you start getting “glitches” in the system. Those glitches are unpredictable results in the outputs. The further out of phase the clock drifts, the more unpredictable the results in the output. Clock signals are setup to keep things on task.
In this case, when the Intel Atom C2000 is installed in a system, it’s providing the clock signal for the Low Pin Count bus. This is the bus that tends to contain simple connections, like serial ports or console ports. But, one of the other things that often connects to the LPC bus is the boot ROM for a system. If the C2000 dies, it denies access to the boot ROM of the device. That’s why the problem is usually not apparent until the device reboots. The boot ROM wouldn’t be accessed otherwise.
The other problem here is that the issue doesn’t have any telltale markers. It just happens one day. When the C2000 dies, it’s gone for good. Intel is working on a way to try and fix the chips already in the wild, but even they are admitting that the fix is only temporary and the real solution is getting new chips and boards in place. Cisco has already embarked on a replacement program, as have the numerous other manufacturers that have used the C2000 chips.
This does cast some shade on the future of merchant silicon usage in devices. Back when this appeared to be a simple issue with Cisco devices only, people were irritated at Cisco’s insistence that custom fabricated silicon was the way to go. But now that the real culprit appears to be Intel, it should give switch vendors pause about standardizing on a specific platform.
What if this issue had been present in Broadcom Trident or Tomahawk chips? What if a component of OCP or Wedge had a hidden fault? The larger the install base of a particular chip, the more impactful the outage could be. Imagine a huge Internet of Things (IoT) deployment that relies on a weak link in the chain that can fail at any point and brick the device in question. The recall and replacement costs would be astronomical. Even for Cisco in this case, replacing the subset of affected devices is going to be very costly.
Reliance on single components for these kinds of applications is a huge risk, but it’s also good business. Intel provides the C2000 at low cost because it’s designed to be widely deployed. Just like any of the other components you’d find at a Fry’s or old Radio Shack, they are mass produced to serve a purpose. As we move more toward merchant silicon and whitebox as the starting point for deliver value in switching, we have to realize that we need to stay on top of the components themselves instead of just taking the hardware for granted. Because one little glitch here and there can lead to a lot of trouble down the road. And the clock is always ticking on things like this.