Intel’s Ticking Atom Bomb

clock

It started somewhat innocently. Cisco released a field notice that there was an issue with some signal clocks on a range of their networking devices. This by itself was a huge issue. There had been rumblings about this issue for a few months. Some proactive replacement of affected devices to test things. Followed by panicked customer visits when the news broke on February 2nd. Cisco looked like they were about to get a black eye.

The big question that arose was whether or not this issue was specific to Cisco devices or if it was an issue that was much bigger. Some investigative work from enterprising folks like Tony Mattke (@tonhe) found that there was a spec document from Intel that listed a specific issue with the Intel Atom C2000 System on Chip (SoC) that caused it to fail to provide clock signal for onboard chips. The more digging that was done, the more dire this issue turned out to be.

Tick, Tick, Tick

Clock signaling is very important in modern electronics. It ensures that all the chips on the board are using the correct timing to process electronic impulses. If the clock signal starts drifting, you start getting “glitches” in the system. Those glitches are unpredictable results in the outputs. The further out of phase the clock drifts, the more unpredictable the results in the output. Clock signals are setup to keep things on task.

In this case, when the Intel Atom C2000 is installed in a system, it’s providing the clock signal for the Low Pin Count bus. This is the bus that tends to contain simple connections, like serial ports or console ports. But, one of the other things that often connects to the LPC bus is the boot ROM for a system. If the C2000 dies, it denies access to the boot ROM of the device. That’s why the problem is usually not apparent until the device reboots. The boot ROM wouldn’t be accessed otherwise.

The other problem here is that the issue doesn’t have any telltale markers. It just happens one day. When the C2000 dies, it’s gone for good. Intel is working on a way to try and fix the chips already in the wild, but even they are admitting that the fix is only temporary and the real solution is getting new chips and boards in place. Cisco has already embarked on a replacement program, as have the numerous other manufacturers that have used the C2000 chips.

Laying Landmines

This does cast some shade on the future of merchant silicon usage in devices. Back when this appeared to be a simple issue with Cisco devices only, people were irritated at Cisco’s insistence that custom fabricated silicon was the way to go. But now that the real culprit appears to be Intel, it should give switch vendors pause about standardizing on a specific platform.

What if this issue had been present in Broadcom Trident or Tomahawk chips? What if a component of OCP or Wedge had a hidden fault? The larger the install base of a particular chip, the more impactful the outage could be. Imagine a huge Internet of Things (IoT) deployment that relies on a weak link in the chain that can fail at any point and brick the device in question. The recall and replacement costs would be astronomical. Even for Cisco in this case, replacing the subset of affected devices is going to be very costly.


Tom’s Take

Reliance on single components for these kinds of applications is a huge risk, but it’s also good business. Intel provides the C2000 at low cost because it’s designed to be widely deployed. Just like any of the other components you’d find at a Fry’s or old Radio Shack, they are mass produced to serve a purpose. As we move more toward merchant silicon and whitebox as the starting point for deliver value in switching, we have to realize that we need to stay on top of the components themselves instead of just taking the hardware for granted. Because one little glitch here and there can lead to a lot of trouble down the road. And the clock is always ticking on things like this.

Intel and the Network Arms Race

IntelLogo

Networking is undergoing a huge transformation. Software is surely a huge driver for enabling technology to grow by leaps and bounds and increase functionality. But the hardware underneath is growing just as much. We don’t seem to notice as much because the port speeds we deal with on a regular basis haven’t gotten much faster than the specs we read about years go. But the chips behind the ports are where the real action is right now.

Fueling The Engines Of Forwarding

Intel has jumped into networking with both feet and is looking to land on someone. Their work on the Data Plane Development Kit (DPDK) is helping developers write code that is highly portable across CPU architecture. We used to deal with specific microprocessors in unique configurations. A good example is Dynamips.

Most everyone is familiar with this program or the projects that spawned, Dynagen and GNS3. Dynamips worked at first because it emulated the MIPS processor found in Cisco 7200 routers. It just happened that the software used the same code for those routers all the way up to the first releases of the 15.x train. Dynamips allowed for the emulation of Cisco router software but it was very, very slow. It almost didn’t allow for packets to be processed. And most of the advanced switching features didn’t work at all thanks to ASICs.

Running networking code on generic x86 processors doesn’t provide the kinds of performance that you need in a network switching millions of packets per second. That’s why DPDK is helping developers accelerate their network packet forward to approach the levels of custom ASICs. This means that a company could write software for a switch using Intel CPUs as the base of the system and expect to get good performance out of it.

Not only can you write code that’s almost as good as the custom stuff network vendors are creating, but you can also have a relative assurance that the code will be portable. Look at the pfSense project. It can run on some very basic hardware. But the same code can also run on a Xeon if you happen to have one of those lying around. That performance boost means a lot more packet switching and processing. No modifications to the code needed. That’s a powerful way to make sure that your operating system doesn’t need radical modifications to work across a variety of platforms, from SMB and ROBO all the way to an enterprise core device.

Fighting The Good Fight

The other reason behind Intel’s drive to get DPDK to everyone is to fight off the advances of Broadcom. It used to be that the term merchant silicon meant using off-the-shelf parts instead of rolling your own chips. Now, it means “anything made by Broadcom that we bought instead of making”. Look at your favorite switching vendor and the odds are better than average that the chipset inside their most popular switches is a Broadcom Trident, Trident 2, or even a Tomahawk. Yes, even the Cisco Nexus 9000 runs on Broadcom.

Broadcom is working their way to the position of arms dealer to the networking world. It soon won’t matter what switch wins because they will all be the same. That’s part of the reason for the major differentiation in software recently. If you have the same engine powering all the switches, your performance is limited by that engine. You also have to find a way to make yourself stand out when everything on the market has the exact same packet forwarding specs.

Intel knows how powerful it is to become the arms dealer in a market. They own the desktop, laptop, and server market space. Their only real competition is AMD, and one could be forgiven for arguing that the only reason AMD hasn’t gone under yet is through a combination of video card sales and Intel making sure they won’t get in trouble for having a monopoly. But Intel also knows what it feels like to miss the boat on a chip transition. Intel missed the mobile device market, which is now ruled by ARM and custom SoC manufacturing. Intel needs to pull off a win in the networking space with DPDK to ensure that the switches running in the data center tomorrow are powered by x86, not Broadcom.


Tom’s Take

Intel’s on the right track to make some gains in networking. Their new Xeon chips with lots and lots of cores can do parallel processing of workloads. Their contributions to CoreOS will help the accelerate the adoption of containers, which are becoming a standard part of development. But the real value for Intel is helping developers create portable networking code that can be deployed on a variety of devices. That enables all kinds of new things to come, from system scaling to cloud deployment and beyond.