SDN and Elephants

Blind_men_and_elephant3

There is a parable about six blind men that try to describe an elephant based on touch alone.  One feels the tail and thinks the elephant is a rope.  One feels the trunk and thinks it’s a tree branch.  Another feels the leg and thinks it is a pillar.  Eventually, the men realize they have to discuss what they have learned to get the bigger picture of what an elephant truly is.  As you have likely guessed, there is a lot here that applies to SDN as well.

Point of View

Your experience with IT is going to dictate your view on SDN.  If you are a DevOps person, you are going to see all the programmability aspects of SDN as important.  If you are a network administrator, you are going to latch on to the automation and orchestration pieces as a way to alleviate the busy work of your day.  If you are a person that likes to have a big picture of your network, you will gravitate to the controller-based aspects of protocols like OpenFlow and ACI.

However, if you concentrate on the parts of SDN that is most familiar, you will miss out on the bigger picture.  Just as an elephant is more than just a trunk or a tail, SDN is more than the individual pieces.  Programmable interfaces and orchestration hooks are means to an end.  The goal is to take all the pieces and make them into something greater.  That takes vision.

The Best Policy

I like what’s going on both with Cisco’s ACI and the OpenStack Congress project.  They’ve moved past the individual parts of SDN and instead are working on creating a new framework to apply those parts. Who cares how orchestration works in a vacuum?  Now, if that orchestration allows a switch to be auto-provisioned on boot, that’s something.  APIs are a part of a bigger push to program networks.  The APIs themselves aren’t the important part.  It’s the interface that interacts with the API that’s crucial.  Congress and ACI are that interface.

Policy is something we all understand.  Think for a moment about quality of service (QoS). Most of the time engineers don’t know LLQ from CBWFQ or PQ.  But if you describe what you want to do, you get it quickly.  If you want a queue that guarantees a certain amount of bandwidth but has a cap you know which implementation you need to use.  With policies, you don’t need to know even that.  You can create the policy that says to reserve a certain amount of bandwidth but not to go past a hard limit.  But you don’t need to know if that’s LLQ or some other form of queuing.  You also don’t need to worry about implementing it on specific hardware or via a specific vendor.

Policies are agnostic.  So long as the API has the right descriptors you can program a Cisco switch just as easily as a Juniper router.  The policy will do all the work.  You just have to figure out the policy.  To tie it back to the elephant example, you find someone with the vision to recognize the individual pieces of the elephant and make them into something greater than a rope and pillar.  You then realize that the elephant isn’t quite like what you were thinking, but instead has applications above and beyond what you could envision before you saw the whole picture.


Tom’s Take

As far as I’m concerned, vendors that are still carrying on about individual aspects of SDN are just trying to distract from a failure to have vision.  VMware and Cisco know where that vision needs to be.  That’s why policy is the coin of the realm.  Congress is the entry point for the League of Non-Aligned Vendors.  If you want to fight the bigger powers, you need to back the solution to your problems.  If Congress could program Brocade, Arista, and HP switches effortlessly then many small-to-medium enterprise shops would rush to put those devices into production.  That would force the superpowers to take a look at Congress and interface with it.  That’s how you build a better elephant – by making sure all the pieces work together.

Security’s Secret Shame

photo-2

Heartbleed captured quite a bit of news these past few days.  A hole in the most secure of web services tends to make people a bit anxious.  Racing to release patches and assess the damage consumed people for days.  While I was a bit concerned about the possibilities of exposed private keys on any number of web servers, the overriding thought in my mind was instead about the speed at which we went from “totally secure” to “Swiss cheese security” almost overnight.

Jumping The Gun

As it turns out, the release of the information about Heartbleed was a bit sudden.  The rumor is that the people that discovered the bug were racing to release the information as soon as the OpenSSL patch was ready because they were afraid that the information had already leaked out to the wider exploiting community.  I was a bit surprised when I learned this little bit of information.

Exploit disclosure has gone through many phases in recent years.  In days past, the procedure was to report the exploit to the vendor responsible.  The vendor in question would create a patch for the issue and prepare their support organization to answer questions about the patch.  The vendor would then release the patch with a report on the impact.  Users would read the report and patch the systems.

Today, researchers that aren’t willing to wait for vendors to patch systems instead perform an end-run around the system.  Rather than waiting to let the vendors release the patches on their cycle, they release the information about the exploit as soon as they can.  The nice ones give the vendors a chance to fix things.  The less savory folks want the press of discovering a new hole and project the news far and wide at every opportunity.

Shame Is The Game

Part of the reason to release exploit information quickly is to shame the vendor into quickly releasing patch information.  Researchers taking this route are fed up with the slow quality assurance (QA) cycle inside vendor shops.  Instead, they short circuit the system by putting the issue out in the open and driving the vendor to release a fix immediately.

While this approach does have its place among vendors that move a glacial patching pace, one must wonder how much good is really being accomplished.  Patching systems isn’t a quick fix.  If you’ve ever been forced to turn out work under a crucial deadline while people were shouting at you, you know the kind of work that gets put out.  Vendor patches must be tested against released code and there can be no issues that would cause existing functionality to break.  Imagine the backlash if a fix to the SSL module cause the web service to fail on startup.  The fix would be worse than the problem.

Rather than rushing to release news of an exploit, researchers need to understand the greater issues at play.  Releasing an exploit for the sake of saving lives is understandable.  Releasing it for the fame of being first isn’t as critical.  Instead of trying to shame vendors into releasing a patch rapidly to plug some hole, why not work with them instead to identify the issue and push the patch through?  Shaming vendors will only put pressure on them to release questionable code.  It will also alienate the vendors from researchers doing   things the right way.


Tom’s Take

Shaming is the rage now.  We shame vendors, users, and even pets.  Problems have taken a back seat to assigning blame.  We try to force people to change or fix things by making a mockery of what they’ve messed up.  It’s time to stop.  Rather than pointing and laughing at those making the mistake, you should pick up a keyboard and help them fix it. Shaming doesn’t do anything other than upset people.  Let’s put it to bed and make things better by working together instead of screaming “Ha ha!” when things go wrong.

End Of The CLI? Or Just The Start?

update1-leak3-hero

Windows 8.1 Update 1 launches today. The latest chapter in Microsoft’s newest OS includes a new feature people been asking for since release: the Start Menu. The biggest single UI change in Windows 8 was the removal of the familiar Start button in favor of a combined dashboard / Start screen. While the new screen is much better for touch devices, the desktop population has been screaming for the return of the Start Menu. Windows 8.1 brought the button back, although it only linked to the Start screen. Update 1 promises to add functionality to the button once more. As I thought about it, I realized there are parallels here that we in the networking world can learn as well.

Some very smart people out there, like Colin McNamara (@ColinMcNamara) and Matt Oswalt (@Mierdin) have been talking about the end of the command line interface (CLI). With the advent of programmable networks and API-driven configuration the CLI is archaic and unnecessary, or so the argument goes. Yet, there is a very strong contingent of the networking world that is clinging to the comfortable glow of a terminal screen and 80-column text entry.

Command The Line

API-driven interfaces provide flexibility that we can’t hope to match in a human interface. There is no doubt that a large portion of the configuration of future devices will be done via API call or some sort of centralized interface that programs the end device automatically. Yet, as I’ve said before, engineers don’t like have visibility into a system. Getting rid of the CLI for the sake of streamlining a device is a bad idea.

I’ve worked with many devices that don’t have a CLI. Cisco Catalyst Express switches leap immediately to mind. Other devices, like the Cisco UC500 SMB phone system, have a CLI but use of it is discouraged. In face, when you configure the UC500 using the CLI, you start getting warnings about not being able to use the GUI tool any longer. Yet there are functions that are only visible through the CLI.

Non-Starter

Will the programmable networking world will make the same mistake Microsoft did with Windows 8? Even a token CLI is better than cutting it out entirely. Programmable networking will allow all kinds of neat tricks. For instance, we can present a Cisco-like CLI for one group of users and a Juniper-like CLI for a different group that both accomplish the same results. We don’t need to have these CLIs sitting around resident memory. We should be able to generate them on the fly or call the appropriate interfaces from a centralized library. Extensibility, even in the archaic interface of last resort.

If all our talk revolves around the removal of the tool people have been using for decades to program devices you will make enemies quickly. The talk needs to shift from the death of CLI and more toward the advantages gained through adding API interfaces to your programming. Even if our interface into calling those APIs looks similar to a comfortable CLI, you’re going to win more converts up front if you give them something they recognize as a transition mechanism.


Tom’s Take

Microsoft bit off more than they could chew when they exiled the Start Menu to the same pile as DOSShell and Microsoft Bob. People have spent almost 20 years humming the Rolling Stones’ “Start Me Up” as they click on that menu. Microsoft drove users to this approach. To pull it out from under them all at once with no transition plan made for unhappy users. Networking advocates need to be just as cognizant of the fact that we’re headed down the same path. We need to provide transition options for the die-hard engineers out there so they can learn how to program devices via non-traditional interfaces. If we try to make them quit cold turkey you can be sure the Start Menu discussion will pale in comparison.

The Sunset of Windows XP

WindowsXPMeadow

The end is nigh for Microsoft’s most successful operating system of all time. Windows XP is finally set to reach the end of support next Tuesday. After twelve and a half years, and having its death sentence commuted at least twice, it’s time for the sunset of the “experienced” OS. Article after article has been posted as of late discussing the impact of the end of support for XP. The sky is falling for the faithful. But is it really?

Yes, as of April 8, 2014, Windows XP will no longer be supported from a patching perspective. You won’t be able to call in a get help on a cryptic error message. But will your computer spontaneously combust? Is your system going to refuse to boot entirely and force you at gunpoint to go out and buy a new Windows 8.1 system?

No. That’s silly. XP is going to continue to run just as it has for the last decade. XP will be as secure on April 9th as it was on April 7th. But it will still function. Rather than writing about how evil Microsoft is for abandoning an operating system after one of the longest support cycles in their history, let’s instead look at why XP is still so popular and how we can fix that.

XP is still a popular OS with manufacturing systems and things like automated teller machines (ATMs). That might be because of the ease with which XP could be installed onto commodity hardware. It could also be due to the difficulty in writing drivers for Linux for a large portion of XP’s life. For better or worse, IT professionals have inherited a huge amount of embedded systems running an OS that got the last major service pack almost six years ago.

For a moment, I’m going to take the ATMs out of the equation. I’ll come back to them in a minute. For the other embedded systems that don’t dole out cash, why is support so necessary? If it’s a manufacturing system that’s been running for the last three or four years what is another year of support from Microsoft going to get you? Odds are good that any support that manufacturing system needs is going to be entirely unrelated to the operating system. If we treat these systems just like an embedded OS that can’t be changed or modified, we find that we can still develop patches for the applications running on top of the OS. And since XP is one of the most well documented systems out there, finding folks to write those patches shouldn’t be difficult.

In fact, I’m surprised there hasn’t been more talk of third party vendors writing patches for XP. I saw more than a few start popping up once Windows 2000 started entering the end of its life. It’s all a matter of the money. Banks have already started negotiating with Microsoft to get an extension of support for their ATM networks. It’s funny how a few million dollars will do that. SMBs are likely to be left out in the cold for specialized systems due to the prohibitive cost of an extended maintenance contract, either from Microsoft or another third party. After all, the money to pay those developers needs to come from somewhere.


Tom’s Take

Microsoft is not the bad guy here. They supported XP as long as they could. Technology changes a lot in 12 years. The users aren’t to blame either. The myth of a fast upgrade cycle doesn’t exist for most small businesses and enterprises. Every month that your PC can keep running the accounting software is another month of profits. So who’s fault is the end of the world?

Instead of looking at it as the end, we need to start learning how to cope with unsupported software. Rather than tilting at the windmills in Remond and begging for just another month or two of token support we should be investigating ways to transition XP systems that can’t be upgraded within 6-8 months to an embedded systems plan. We’ve reached the point where we can’t count on anyone else to fix our XP problems but ourselves. Once we have the known, immutable fact of no more XP support, we can start planning for the inevitable – life after retirement.

The Alignment of Net Neutrality

Net neutrality has been getting a lot of press as of late, especially as AT&T and Netflix have been sparring back and forth in the press.  The FCC has already said they are going to take a look at net neutrality to make sure everyone is on a level playing field.  ISPs have already made their position clear.  Where is all of this posturing going to leave the users?

Chaotic Neutral

Broadband service usage has skyrocketed in the past few years.  Ideas that would never have been possible even 5 years ago are now commonplace.  Netflix and Hulu have made it possible to watch television without cable.  Internet voice over IP (VoIP) allows a house to have a phone without a phone line.  Amazon has replaced weekly trips to the local department store for all but the most crucial staple items.  All of this made possible by high speed network connectivity.

But broadband doesn’t just happen.  ISPs must build out their networks to support the growing hunger for faster Internet connectivity.  Web surfing and email aren’t the only game in town.  Now, we have streaming video, online multiplayer, and persistently connected devices all over the home.  The Internet of Things is going to consume a huge amount of bandwidth in an average home as more smart devices are brought online.  ISPs are trying to meet the needs of their subscribers.  But are they going far enough?

ISPs want to build networks their customers will use, and more importantly pay to use.  They want to ensure that complaints are kept to a minimum while providing the services that customers demand.  Those ISP networks cost a hefty sum.  Given the choice between paying to upgrade a network and trying to squeeze another month or two out of existing equipment, you can guarantee the ISPs are going to take the cheaper route.  Coincidentally, that’s one of the reasons why the largest backers of 802.1aq Shortest Path Bridging were ISP-oriented.  SPB doesn’t require new equipment to forward frames (like TRILL).  ISPs can use existing equipment to deliver SPB with no out-of-pocket expenditure on hardware.  That little bit of trivia should give you an idea why ISPs are trying to do away with net neutrality.

True Neutral

ISPs want to keep using their existing equipment as long as possible.  Every dollar they make from this cycle’s capital expenditure means a dollar of profit in their pocket before they have to replace a switch.  If there was a way to charge even more money for existing services, you can better believe they would do it.  Which is why this infographic hits home for most:

net-neutrality

Charging for service tiers would suit ISPs just fine.  After all, as the argument goes, you are using more than the average user.  Shouldn’t you shoulder the financial burden of increased network utilization?  That’s fine for corner cases like developers or large consumers of downstream bandwidth.  But with Netflix usage increasing across the board, why should the ISP charge you more on top of a Netflix subscription?  Shouldn’t their network anticipate the growing popularity of streaming video?

The other piece of the tiered offering above that should give pause is the common carrier rules for service providers.  Common carriers get to be absolved of liability for the things they transport because they have to agree to transport everything offered to them.  What do you think would happen if those carriers suddenly decide they want to discriminate about what they send?  If that discrimination revokes their common carrier status, what’s to stop them from acting like a private carrier and start refusing to transport certain applications or content?  Maybe forcing a video service to negotiate a separate peering agreement for every ISP they want to use?  Who would do that?

Neutral Good

Net Neutrality has to exist to ensure that we are free to use the services we want to consume.  Sure, this means that things like Quality of Service (QoS) can’t be applied to packets to ensure they are all being treated equally.  The inverse is to have guaranteed delivery for an additional fee.  And every service you add on to the top would incur more fees.  New multiplayer game launching next week? The ISP will charge you an extra $5 per month to insure you have a low ping time to beat the other guy.  If you don’t buy the package, your multiplayer traffic gets dumped in with Netflix and the rest of the bulk traffic.

This is part of the reason why Google Fiber is such a threat to existing ISPs.  When the only options for local loop delivery are the cable company and the phone company, it’s difficult to have options that aren’t being tiered in the absence of neutrality.  With viable third party fiber buildouts like Google starting to spring up it becomes a bargaining chip to increase speeds to users and upgrade backbones to support heavy usage.  If you don’t believe that, look at what AT&T did immediately after Google announced Google Fiber in Austin, TX.


Tom’s Take

ISPs shouldn’t be able to play favorites with their customers.  End users are paying for a connection.  End users are also paying services to use their offerings.  Why should we have to pay for a service twice if the ISP wants to charge me more in a tiering setup?  That smells of a protection racket in many ways.  I can imagine the ISP techs sitting there in a slick suit saying, “That’s a nice connection you got there.  It would be a shame if something were to happen to it.”  Instead, it’s up to the users to demand ISPs offer free and unrestricted access to all content.  In some cases, that will mean backing alternatives and “voting with your dollar” to make the message be heard loud and clear.  I won’t sign up for services that have data usage caps or metered speed limits past a certain ceiling.  I would drop any ISP that wants me to pay extra just because I decide to start using a video streaming service or a smart thermostat.  It’s time for ISPs to understand that hardware should be an investment in future customer happiness and not a tool that’s used to squeeze another dime out of their user base.

Throw More Storage At It

All of  this has happened before.   All of this will happen again.

The more time I spend listening to storage engineers talk about the pressing issues they face in designing systems in this day and age, the more I’m convinced that we fight the same problems over and over again with different technologies.  Whether it be networking, storage, or even wireless, the same architecture problems crop in new ways and require different engineers to solve them all over again.

Quality is Problem One

A chance conversation with Howard Marks (@DeepStorageNet) at Storage Field Day 4 led me to start thinking about these problems.  He was talking during a presentation about the difficulty that storage vendors have faced in implementing quality of service (QoS) in storage arrays.  As Howard described some of the issues with isolating neighboring workloads and ensuring they can’t cause performance issues for a specific application, I started thinking about the implementation of low latency queuing (LLQ) for voice in networking.

LLQ was created to solve a specific problem.  High volume, low bandwidth flows can starve traditional priority queuing systems.  In much the same way, applications that demand high amounts of input/output operations per second (IOPS) while storing very little data can cause huge headaches for storage admins.  Storage has tried to solve this problem with hardware in the past by creating things like write-back caching or even super fast flash storage caching tiers.

Make It Go Away

In fact, a lot of the problems in storage mirror those from networking world many years ago.  Performance issues used to have a very simple solution – throw more hardware at the problem until  it goes away.  In networking, you throw bandwidth at the issue.  In storage, more IOPS are the answer.  When hardware isn’t pushed to the absolute limit, the answer will always be to keep driving it higher and higher.  But what happens when performance can’t fix the problem any more?

Think of a sports car like the Bugatti Veyron.  It is the fastest production car available today, with a top speed well over 250 miles per hour.  In fact, Bugatti refuses to talk about the absolute top speed, instead countering “Would you ask how fast a jet can fly?”  One of the limiting factors in attaining a true measure of the speed isn’t found in the car’s engine.  Instead, the sub systems around the engine being to fail at such stressful speeds.  At 258 miles per hour, the tires on the car will completely disintegrate in 15 minutes.  The fuel tank will be emptied in 12 minutes.  Bugatti wisely installed a governor on the engine limiting it to a maximum of 253 miles per hour in an attempt to prevent people from pushing the car to its limits.  A software control designed to prevent performance issues by creating artificial limits.  Sound familiar?

Storage has hit the wall when it comes to performance.  PCIe flash storage devices are capable of hundreds of thousands of IOPS.  A single PCI card has the throughput of an entire data center.  Hungry applications can quickly saturate the weak link in a system.  In days gone by, that was the IOPS capacity of a storage device.  Today, it’s the link that connects the flash device to the rest of the system.  Not until the advent of PCIe was the flash storage device fast enough to keep pace with workloads starved for performance.

Storage isn’t the only technology with hungry workloads stressing weak connection points.  Networking is quickly reaching this point with the shift to cloud computing.  Now, instead of the east-west traffic between data center racks being a point of contention, the connection point between the user and the public cloud will now be stressed to the breaking point.  WAN connection speeds have the same issues that flash storage devices do with non-PCIe interfaces.  They are quickly saturated by the amount of outbound traffic generated by hungry applications.  In this case, those applications are located in some far away data center instead of being next door on a storage array.  The net result is the same – resource contention.


Tom’s Take

Performance is king.  The faster, stronger, better widgets win in the marketplace.  When architecting systems it is often much easier to specify a bigger, faster device to alleviate performance problems.  Sadly, that only works to a point.  Eventually, the performance problems will shift to components that can’t be upgraded.  IOPS give way to transfer speeds on SATA connectors.  Data center traffic will be slowed by WAN uplinks.  Even the CPU in a virtual server will eventually become an issue after throwing more RAM at a problem.  Rather than just throwing performance at problems until they disappear, we need to take a long hard look at how we build things and make good decisions up front to prevent problems before they happen.

Like the Veyron engineers, we have to make smart choices about limiting the performance of a piece of hardware to ensure that other bottlenecks do not happen.  It’s not fun to explain why a given workload doesn’t get priority treatment.  It’s not glamorous to tell someone their archival data doesn’t need to live in a flash storage tier.  But it’s the decision that must be made to ensure the system is operating at peak performance.

Replacing Nielsen With Big Data

I’m a fan of television. I like watching interesting programs. It’s been a few years since one has caught my attention long enough to keep my watching for multiple seasons. Part of that reason is due to my fear that an awesome program is going to fail to reach a “target market” and will end up canceled just when it’s getting good. It’s happened to several programs that I liked in the past.

Sampling the Goods

Part of the issue with tracking the popularity of television programs comes from the archaic method in which programs are measured. Almost everyone has heard of the Nielsen ratings system. This sampling method was created in the 1930s as a way to measure radio advertising reach. In the 50s, it was adapted for use in television.

Nielsen selects target audiences that represent the greater whole. They ask users to keep written diaries of their television watching habits. They also have the ability to install a device called a set meter which allows viewers to punch in a code to identify themselves via age groups and lock in a view for a program. The set meter can tell the instant a channel is changed or the TV is powered off.

In theory, the sampling methodology is sound. In practice, it’s a bit shaky. View diaries are unreliable because people tend to overreport their view habits. If they feel guilty that they haven’t been writing anything down, they tend to look up something that was on TV and write it down. Diaries also can’t determine if a viewer watched the entire program or changed the channel in the middle. Set meters aren’t much better. The reliance on PIN codes to identify users can lead to misreported results. Adults in a hurry will sometimes punch in an easier code assigned to their children, leading to skewed age results.

Both the diary and the set meter fail to take into account the shift in viewing habits in modern households. Most of the TV viewing in my house takes place through time-shifted DVR recordings. My kids tend to monopolize the TV during the day, but even they are moving to using services like Netflix to watch all episodes of their favorite cartoons in one sitting. Neither of these viewing habits are easily tracked by Nielsen.

How can we find a happy medium? Sample sizes have been reduced signifcantly due to cord-cutting households moving to Internet distribution models. People tend to exaggerate or manipulate self-reported viewing results. Even “modern” Nielsen technology can’t keep up. What’s the answer?

Big Data

I know what you’re saying: “We’ve already got that with Nielsen, right?” Not quite. TV viewing habits have shifted in the past few years. So has TV technology. Thanks to the shift from analog broadcast signals to digital and the explosion of set top boxes for cable decryption and movie service usage, we now have a huge portal into the living room of every TV watcher in the world.

Think about for a moment. The idea of a sample size works provided it’s a good representative sample. But tracking this data is problematic. If we have access to a way to crunch the actual data instead of extrapolating from incomplete sets shouldn’t we use that instead? I’d rather believe the real numbers instead of trying to guess from unreliable sources.

This also fixes the issue of time-shifted viewing. Those same set top boxes are often responsible for recording the programs. They could provide information such as number of shows recorded versus viewed and whether or not viewers skip through commercials. For those that view on mobile devices that data could be compiled as well through integration with the set top box. User logins are required for mobile apps as it is today. It’s just a small step to integrating the whole package.

It would require a bit of technical upgrading on the client side. We would have to enable the set top boxes to report data back to a service. We could anonymize the data to a point to be sure that people aren’t being unnecessarily exposed. It will also have to be configured as an opt-out setting to ensure that the majority is represented. Opt-in won’t work because those checkboxes never get checked.

Advertisers are going to demand the most specific information about people that they can. The ratings service exists to work for the advertisers. If this plan is going to work, a new company will have to be created to collect and analyze this data. This way the analysis company can ensure that the data is specific enough to be of use to the advertisers while at the same time ensuring the protection of the viewers.


Tom’s Take

Every year, promising new TV shows are yanked off the airwaves because advertisers don’t see any revenue. Shows that have a great premise can’t get up to steam because of ratings. We need to fix this system. In the old days, the deluge of data would have drown Nielsen. Today, we have the technology to collect, analyze, and store that data for eternity. We can finally get the real statistics on how many people watched Jericho or After MASH. Armed with real numbers, we can make intelligent decisions about what to keep on TV and what to jettison. And that’s a big data project I’d be willing to watch.