Your Data Center Isn’t Facebook And That’s Just Fine

FBLike

While at the Software Defined Data Center Symposium, I had the good fortune to moderate a panel focused on application focused networking in the data center. There were some really smart engineers on that panel. One of the most impressive people was Najam Ahmad from Facebook. He is their Director of Technical Operations. He told me some things about Facebook that made me look at what they are doing a in a new light.

Najam said when I asked him about stakeholder perceptions that he felt a little out of sorts on stage because Ivan Pepelnjak (@IOSHints) and David Cheperdak (@DavidCheperdak) had spent the last fifteen minutes talking about virtual networking. Najam said that he didn’t really know what a hypervisor or a vSwitch were because they don’t run them at Facebook. All of their operating systems and servers run directly on bare metal. That shocked me a bit. Najam said that inserting anything in between the server and what its function was added unnecessary overhead. That’s a pretty unique take on things when you look at how many data centers are driving toward full virtualization.

Old Tools, New Uses

Facebook also runs BGP to the top-of-rack (ToR) switches in their environment. That means that they are doing layer 3 all the way to their access layer. What’s funny is that while BGP in the ToR switches provides for scalability and resiliency, they don’t use BGP as their primary protocol when exchanging routes with providers.  For Facebook, BGP at the edge of doesn’t provide enough control over network egress. They take the information that BGP is providing and they crunch it a bit further before adding that all into a controller-based solution that applies business logic and policies to determine the best solution for a given network scenario.

Najam also said that they had used NetFlow for a while to collect data from their servers in order to build a picture of what was going on inside the network. What they found is that the collectors were becoming overwhelmed by the amount of data that they were being hammered with. So instead of installing bigger, faster collectors the Facebook engineers broke the problem apart by putting a small shim program on every server to collect the data and then forward to a system designed to collect data inputs, not just NetFlow inputs. Najam lovingly called their system “FBFlow”.

I thought about this for a while before having a conversation with Colin McNamara (@ColinMcNamara). He told me that this design was a lot more common than I previously thought and that he had implemented it a few times already. At service providers. That’s when things really hit home for me.

Providing Services

Facebook is doing the same things that you do in your data center today. They’re just doing it at a scale that’s one or two orders of magnitude bigger. The basics are all still there: Facebook pushes packets around a network to feed servers and provide applications for consumption by users. What is so different is that the scale at which Facebook does this begins to look less and less like a traditional data center and more and more like a service provider. After all, they *are* providing a service to their users.

I’ve talked before about how Facebook’s Open Compute Project (OCP) switch wasn’t going to be the death knell for traditional networking. Now you see some of that validated in my opinion. Facebook is building hardware to meet their needs because they are a strange hybrid of data center and service provider. Things that we would do successfully in a 500 VM system don’t scale at all for them. Crazy ideas like running exterior gateway routing protocols on ToR switches work just fine for them because of the scale at which they are operating.

Which brings me to the title of the post. People are always holding Facebook and Google in such high regard for what they are doing in their massive data centers. Those same people want to try to emulate that in their own data centers and often find that it just plain doesn’t work.  It’s the same set of protocols.  Why won’t this work for me?

Facebook is solving problems just like a service provider would.  They are building not for continuous uptime, but instead for detectable failures that are quickly recoverable.  If I told you that your data center was going to be down for ten minutes next month you’d probably be worried.  If I told you that those outages were all going to be one minute long and occur ten times, you’d probably be much less worried.  Service providers try to move around failure instead of pouring money into preventing it in the first place.  That’s the whole reasoning behind Facebook’s “Fail Harder” mentality.

Failing Harder means making big mistakes and catching them before they become real problems.  Little issues tend to get glossed over and forgotten about.  Thing about something like Weighted Random Early Detection (WRED).  WRED works because you can drop a few packets from a TCP session and it will keep chugging and request the missing bits.  If you kill the entire connection or blow up a default gateway then you’ve got a real issue.  WRED fixes a problem, global TCP synchronization, by failing quietly once in a while.  And it works.


Tom’s Take

Instead of comparing your data center to Facebook or Google you should be taking a hard look at what you are actually trying to do.  If you are doing Hadoop your data center is going to look radically different than a web services company.  There are lessons you can learn from what the big boys are doing.  Failing harder and using old tools in novel new ways are a good start your own data center analysis and planning.  Just remember that those big data centers aren’t alien environments.  They just have different needs to meet.

Here’s the entire SDDC Symposium Panel with Najam if you’d like to watch it.  He’s got a lot of interesting insights into things besides what I wrote about above.

The Vision Of A ThousandEyes

ThousandEyes_Logo

Scott Adams wrote a blog post once about career advice and whether is was better to be excellent at one thing or good at several things. Basically, being the best at something is fairly hard. There’s always going to be someone smarter or faster than you doing it just a bit better. Many times it’s just as good to be very good at what you do. The magic comes when you take two or three things that are very good and combine them in a way that no one has seen before to make something amazing. The kind of thing that makes people gaze in wonder then immediately start figuring out how to use your thing to be great.

During Networking Field Day 6, ThousandEyes showed the delegates something very similar to what Scott Adams was talking about. ThousandEyes uses tools like Traceroute, Ping, and BGP data aggregation to collect data. These tools aren’t overly special in and of themselves. Ping and Traceroute are built into almost every networking stack. BGP looking glass servers and data analysis have been available publicly for a while and can be leveraged in a tool like BGPMon. All very good tools. What ThousandEyes did was combine them in a way to make them better.

ThousandEyes can show data all along the path of a packet. I can see response times and hop-by-hop trajectory. I can see my data leave one autonomous system (AS) and land in another. Want to know what upstream providers your ISP is using? ThousandEyes can tell you that. All that data can be collected in a cloud dashboard. You can keep tabs on it to know if you service level agreements (SLAs) are being met. Or, you could think outside the box and do something that I found very impressive.

Let’s say you are a popular website that angered someone. Maybe you published an unflattering article. Maybe you cut off a user doing something they should have. Maybe someone out there just has a grudge. With the nuclear options available to most “hackers” today, the distributed denial of service (DDoS) attack seems to be a popular choice. So popular that DDoS mitigation services have sprung up to shoulder the load. The basic idea is that when you determine that you’re being slammed with gigabits of traffic, you just swing the DNS for your website to a service that starts scrubbing away attack traffic and steering legitimate traffic to your site. In theory it should prevent the attackers from taking you offline. But how can you prove it’s working?

ThousandEyes can do just that. In the above video, they show what happened when Bank of America (BoA) was recently knocked offline by a huge DDoS attack. The information showed two of the three DDoS mitigation services were engaged. The third changeover didn’t happen. All that traffic was still being dumped on BoA’s servers. Those BoA boxes couldn’t keep up with what they were seeing, so even the legitimate traffic that was being forwarded on by the mitigation scrubbers got lost in the noise. Now, if ThousandEyes can tell you which mitigation provider failed to engage then that’s a powerful tool to have on your side when you go back to them and tell them to get their act together. And that’s just one example.

I hate calling ISPs to fix circuits because it never seems to be their fault. No matter what I do or who I talk to it never seems to be anything inside the provider network. Instead, it’s up to me to fiddle with knobs and buttons to find the right combination of settings to make my problem go away, especially if it’s packet loss. Now, imagine if you had something like ThousandEyes on your side. Not only could you see the path that your packets are taking through your ISP, you can check latency and see routing loops and suboptimal paths. And, you can take a screenshot of it to forward to the escalation tech during those uncomfortable phone arguments about where the problem lies. No fuss, no muss. Just the information you need to make your case and get the problem fixed.

If you’d like to learn more about ThousandEyes and their monitoring solutions, check out their website at http://www.thousandeyes.com. You can also follow them on Twitter as @ThousandEyes.


Tom’s Take

Vision is a funny thing. Some have it. Some don’t. Having vision can mean many things. It can be someone who assembles tools in a novel way to solve a problem. It can be the ability to collect data and “see” what’s going on in a network path. It can also mean being able to take that approach and use it in a non-obvious way to provide a critical service to application providers that they’ve never had before. Or, as we later found out at Networking Field Day 6 during a presentation with Solarwinds, it can mean having the sense to realize when someone is doing something right, as Joel Dolisy said when asked about ThousandEyes, “Oh, we’ve got our eye on them.” That’s a lot of vision. A ThousandEyes worth.

Special thanks to Ivan Pepelnjak (@IOSHints) for giving me some ideas on this review.

Networking Field Day Disclaimer

While I was not an official delegate at Networking Field Day 6, I did participate in the presentations and discussions. ThousandEyes was a sponsor of Networking Field Day 6. In addition to hosting a presentation in their offices, they provided snacks and drink for the delegates. They also provided a gift bag with a vacuum water bottle, luggage tag, T-shirt, and stickers (which I somehow managed to misplace). At no time did they ask for any consideration in the writing of this review, nor were they offered any. Independence means no restrictions.  The analysis and conclusions contained in this post are mine and mine alone.

Know the Process, Not the Tool

rj45process

If there is one thing that amuses me as of late, it’s the “death of CLI” talk that I’m starting to see coming from many proponents of software defined networking. They like to talk about programmatic APIs and GUI-based provisioning and how everything that network engineers have learned is going to fall by the wayside.  Like this Network World article. I think reports of the death of CLI are a bit exaggerated.

Firstly, the CLI will never go away. I learned this when I stared working with an Aerohive access point I got at Wireless Field Day 2. I already had a HiveManager account provisioned thanks to Devin Akin (@DevinAkin), so all I needed to do was add the device to my account and I would be good to go. Except it never showed up. I could see it on my local network, but it never showed up in the online database. I rebooted and reset several times before flipping the device over and finding a curious port labeled “CONSOLE”. Why would a cloud-based device need a console port. In the next hour, I learned a lot about the way Aerohive APs are provisioned and how there were just some commands that I couldn’t enter in the GUI that helped me narrow down the problem. After fixing a provisioning glitch in HiveManager the next day, I was ready to go. The CLI didn’t fix my problem, but I did learn quite a bit from it.

Basic interfaces give people a great way to see what’s going on under the hood. Given that most folks in networking are from the mold of “take it apart to see why it works” the CLI is great for them. I agree that memorizing a 10-argument command to configure something like route redistribution is a pain in the neck, but that doesn’t come from the difficulty of networking. Instead, the difficulty lies in speaking the language.

I’ve traveled to a foreign country once or twice in my life. I barely have a grasp of the English language at times. I can usually figure out some Spanish. My foreign language skills have pretty much left me at this point. However, when I want to make myself understood to people that speak another language, I don’t focus on syntax. Instead, I focus on ideas. Pointing at an object and making gestures for money usually gets the point across that I want to buy something. Pantomiming a drinking gesture will get me to a restaurant.

Networking is no different. When I started trying to learn CLI terminology for Brocade, Arista, and HP I found they were similar in some respects but very different in others. When you try to take your Cisco CLI skills to a Juniper router, you’ll find that you aren’t even in the neighborhood when it comes to syntax. What becomes important is *what* you’re trying to do. If you can think through what you’re trying to accomplish, there’s usually a help file or a Google search that can pull up the right way to do things.

This extends its way into a GUI/API-driven programming interface as well. Rather than trying to intuit the interface just think about what you want to do instead. If you want two hosts to talk to each other through a low-cost link with basic security you just have to figure out what the drag-and-drop is for that. If you want to force application-specific traffic to transit a host running an intrusion prevention system you already know what you want to do. It’s just a matter of find the right combination of interface programming to accomplish it. If you’re working on an API call using Python or Java you probably have to define the constraints of the system anyway. The hard part is writing the code to interface to accomplish the task.


Tom’s Take

Learning the process is the key to making it in networking. So many entry level folks are worried about *how* to do something. Configuring a route or provisioning a VLAN are the end goal. It’s only when those folks take a step back and think about their task without the commands that they begin to become real engineers. When you can visualize what you want to do without thinking about the commands you need to enter to do it, you are taking the logical step beyond being tied to a platform. Some of the smartest people I know break a task down into component parts and steps. When you spend more time on *what* you are doing and less on *how* you are doing it, you don’t need to concern yourself with radical shifts in networking, whether they be SDN, NFV, or the next big thing. Because the process will never change even if the tools might.

IT Jugglers

Juggle Balls

I once interviewed for a job where the interviewer asked how I decided to work on tasks. He said, “There are two kinds of workers. The first concentrates on a task and does nothing else until it is completed. They can only do one thing at a time. Then, there are the jugglers. Which one are you?” When I responded that I tended toward the latter, the interviewer smiled.  That was obviously the answer he was looking for.

IT is very much defined by focus. Being able to work on a project until it is totally finished is a very admirable quality to be desired. In my experience, especially in the VAR world, it is equally as important to be able to shift your focus quickly to other tasks that require attention. As indicated above, it’s not unlike juggling. Being able to focus on a project for a few hours or days and then move to a different project for a few hours can be a very critical skill for high level engineers.

Technology has been doing this for years. Think about a preemptive multitasking CPU. It appears to be many things at once. It’s really executing instructions for a given process for a period of time (a timeslice). Because you can process enough instructions in that time to accomplish a function it all appears to work like magic. The key is to tune the processor to use the right timeslices. If the timeslice is too long the processor will sit idle waiting for the program to generate new instructions. If the time slice is too short the program won’t be able to execute enough instructions during the window and the program will appear unresponsive. Just like a juggler, it’s all about the timing.

Choosing what to juggle in IT is almost as important as knowing how to do it. When you are just starting out with juggling, you use safe, soft objects to contain the damage. You don’t start off with chainsaws and molotov cocktails. When juggling IT projects, be sure to juggle those that don’t have hard deadlines or require critical path updates on a regular basis. If you’re required to provide a weekly update on an installation, be sure you’ve allocated enough time during the week to do something. Otherwise, that weekly installation report is going to look pretty thin.

When learning to juggle, most people spend entirely too much time worrying about the ball in their hand.  They tend to lose focus of all the other objects floating in the air.  That’s why they tend to start dropping them.  In the same way, you can’t be so dialed in on one project that you completely neglect all the other things going on.  Finding a good point to stop one task and start working on another is a very fine art.

This isn’t for everyone.  If you’re a person that can’t shift focus fast enough to keep all the balls (or projects) in the air without dropping something, you should avoid working on many things at once.  There’s no shame in having laser focus on something.  It works well for a lot of folks.  It gets hard things done right.  It’s just another way to do get the job done.


Tom’s Take

I’m a juggler.  I try to keep everything going at once while I wrap up what I can.  I do my best to avoid dropping things, but something slips through from time to time.  I also taught myself to juggle in real life.  I can keep three tennis balls going with no issues.  I realize my limitations, though.  I know that more than that is too many.  In the project space, I know that having more than I can handle is bad for everything, so I try to keep my focus on a manageable about of juggled things.  It’s better to juggle a few things well than juggle an impressive number of things poorly.  I’ll let you know when I work my way up to the chainsaws.

I Can Fix Gartner

MQFix

I’ve made light of my issues with Gartner before. From mild twitching when the name is mentioned to outright physical acts of dismissal. Aneel Lakhani did a great job on an episode of the Packet Pushers dispelling a lot of the bad blood that most people have for Gartner. I listened and my attitude toward them softened somewhat. It wasn’t until recently that I I finally realized that my problem isn’t necessarily with Gartner. It’s with those that use Gartner as a blunt instrument against me. Simply put, Gartner has a perception problem.

Because They Said So

Gartner produces a lot of data about companies in a number technology related spaces. Switches, firewalls, and wireless devices are all subject to ranking and data mining by Gartner analysts. Gartner takes all that data and uses it to give form to a formless part of the industry. They take inquiries from interested companies and produce a simple ranking for them to use as a yardstick for measuring how one cloud hosting provider ranks against another. That’s a good and noble cause. It’s what happens afterwards that shows what data in the wrong hands can do.

Gartner makes their reports available to interested parties for a price. The price covers the cost of the analysts and the research they produce. It’s no different that the work that you or I do. Because this revenue from the reports is such a large percentage of Gartner’s income, the only folks that can afford it are large enterprise customers or vendors. Enterprise customers are unlikely to share that information with anyone outside their organization. Vendors, on the other hand, are more than willing to share that information with interested parties. Provided that those parties offer up their information as a lead generation exercise and the Gartner report is favorable to the company. Vendors that aren’t seen as a leader in their particular slice of the industry aren’t particularly keen on doing any kind of advertising for their competitors. Leaders, on the other hand, are more than willing to let Gartner do their dirty work for them. Often, that conversation goes like this:

Vendor: You should buy our product. We’re the best.
Customer: Why are you the best? Are you the fastest or the lowest cost? Why should I buy your product?
Vendor: We’re the best because Gartner says so.

The only way that users outside the large enterprises see these reports is when a vendor publishes them as the aforementioned lead generation activity. This skews things considerably for a lot of potential buyers. This disparity becomes even more insulting when the club in question is a polygon.

Troubling Trigonometry

Gartner reports typically include a lot of data points. Those data points tell a story about performance, cost, and value. People don’t like reading data point. They like graphs and charts. In order to simplify the data into something visual, Gartner created their Magic Quadrant (MQ). The MQ distills the entire report into four squares of ranking. The MQ is the real issue here. It’s the worst kind of graph. It doesn’t have any labels on either axis. There’s no way to rank the data points without referring to the accompanying report. However, so many readers rarely read the report that the MQ becomes the *only* basis for comparison.

How much better is Company A at service provider routing than Company B? An inch? Half an inch? $2 billion in revenue? $2,000 gross margin? This is the key data that allows the MQ to be built. Would you know where to find it in the report if you had to? Most readers don’t. They take the MQ as the gospel truth and the only source of data. And the vendors love to point out that they are further to the top and right of the quadrant than their competitors. Sometimes, the ranking seems arbitrary. What makes a company be in the middle of the leaders quadrant versus toward the middle of the graph? Are all companies in the leaders quadrant ranked and placed against each other only? Or against all companies outside the quadrant? Details matter.

Assisting the Analysis

Gartner can fix their perception problems. It’s not going to be easy though. They have the same issue as the Consumer’s Union, producer of Consumer Reports. Where the CU publishes a magazine that has no advertising, they use donations and subscription revenues to offset operating costs. You don’t see television or print ads with Consumer Reports reviews pasted all over them. That’s because the Consumer’s Union specifically forbids their inclusion for commercial purposes.

Gartner needs to take a similar approach if they want to fix the issues with how they’re seen by others. Sell all the reports you want to end users that want to know the best firewall to buy. You can even sell those reports to the firewall vendors themselves. But the vendors should be forbidden from using those reports to resell their products. The integrity you gain from that stance may not offset the loss of vendor revenue right away. But it will gain you customers in the long run that will respect your stance refusing the misuse of Gartner reports as 3rd party advertising copy.

Put a small disclaimer at the bottom of every report: “Gartner provides analysis for interested parties only. Any use of this information as a sales tool or advertising instrument is unintended and prohibited.” That shows what the purpose of the report is about as well as discouraging use simply to sell another hundred widgets.

Another idea that might work to dispel advertising usage of the MQ is releasing last year’s report for little to no cost after 12 months.  That way, the small-to-medium enterprises gain access to the information without sacrificing their independence from a particular vendor.  I don’t think there will be any loss of revenue from these reports, as those that typically buy them will do so within 6-8 months of the release.  That will give the vendors very little room to leverage information that should be in the public domain anyway.  If you feel bad for giving that info away, charge a nominal printing fee of $5 or something like that.  Either way, you’ll blunt the advertising advantage quickly and still accomplish your goal of being seen as the leader in information gathering.


Tom’s Take

I don’t have to whinny like a horse every time someone says Gartner. It’s become a bit of legend by now. What I do take umbrage with is vendors using data points intended for customers to rank purchases and declare that the non-labeled graph of those data points is the sole arbiter of winners and losers in the industry. What if your company doesn’t fit neatly into a Magic Quadrant category? It’s hard to call a company like Palo Alto a laggard in traditional firewalls when they have something that is entirely non-traditional. Reader discretion is key. Use the data in the report as your guide, not the pretty pictures with dots all over them. Take that data and fold it into your own analysis. Don’t take anyone’s word for granted. Make your own decisions. Then, give feedback. Tell people what you found and how accurate those Gartner reports were in making your decision. Don’t give your email address to a vendor that wants to harvest it simply to gain access to the latest report that (surprisingly) show them to be the best. When the advertising angle dries up, vendors will stop using Garter to sell their wares. When that day comes, Gartner will have a real opportunity to transcend their current image and become something more. And that’s a fix worth implementing.

Objective Lessons

PipeHammer

“Experience is a harsh teacher because it gives the test first and the lesson afterwards.” – Vernon Law

When I was in college, I spent a summer working for my father.  He works in the construction business as a superintendent.  I agreed to help him out in exchange for a year’s tuition.  In return, I got exposure to all kinds of fun methods of digging ditches and pouring concrete.  One story that sticks out in my mind over and over taught me the value of the object lesson.

One of the carpenters that worked for my father had a really bad habit of breaking sledgehammer handles.  When he was driving stakes for concrete forms, he never failed to miss the head of the 2×4 by an inch and catch the top of the handle on it instead.  The force of the swing usually caused the head to break off after two or three misses.  After the fourth or fifth broken handle, my father finally had enough.  He took an old sledgehammer head and welded a steel pipe to it to serve as a handle.  When the carpenter brought him his broken hammer yet again, my father handed him the new steel-handle hammer and said, “This is your new tool.  I don’t want to see you using any hammer but this one.”  Sure enough, the carpenter started driving the 2×4 form stakes again.  Only this time when he missed his target, the steel handle didn’t offer the same resistance as the wooden one.  The shock of the vibration caused the carpenter to drop the hammer and shake his hand in a combination of frustration and pain.  When he picked up the hammer again, he made sure to measure his stance and swing to ensure he didn’t miss a second time.  By the end of the summer, he was an expert sledgehammer swinger.

Amusing as it may be, this story does have a purpose.  People need to learn from failure.  For some, the lesson needs to be a bit more direct.  My father’s carpenter had likely been breaking hammer handles his entire life.  Only when confronted with a more resilient handle did he learn to adjust his processes and fix the real issue – his aim.  In technology, we often find that incorrect methods are as much to blame for problems as bad hardware or buggy software.

Thanks to object lessons, I’ve learned to never bridge the two terminals of an analog 66-block connection with a metal screwdriver lest I get a shocking reward.  I’ve watched others try to rack fully populated chassis switches by brute force alone.  And we won’t talk about the time I watched a technician rewire a 220 volt UPS receptacle without turning off the breaker (he lived).  Each time, I knew I needed to step in at some point to prevent physical harm to the person or prevent destruction of the equipment.  But for these folks, the lesson could only be learned after the mistake had been made.  I think this recent tweet from Teren Bryson (@SomeClown) sums it up nicely:

Some people don’t listen to advice.  That’s a fact born out over years and years of working in the industry.  They know that their way is better or more appropriate even against the advice of multiple experts with decades of experience.  For those people that can’t be told anything, a lesson in reality usually serves as the best instructor.  The key is not to immediately jump to the I Told You So mentality afterward.  It is far too easy to watch someone create a bridging loop against your advice and crash a network only to walk up to them and gloat a little about how you knew better.  Instead of stroking your own ego against an embarrassed and potentially worried co-worker, instead take the time to discuss with them why things happened the way they did and coach them to not make the same mistakes again.  Make them learn from their lesson rather than covering it up and making the same mistake again.


Tom’s Take

I’ve screwed up before.  Whether it was deleting mailboxes or creating a routing loop I think I’ve done my fair share of failing.  Object lessons are important because they quickly show the result of failure and give people a chance to learn from it.  You naturally feel embarrassed and upset when it happens.  So long as you gather your thoughts and channel all that frustration into learning from your mistake then things will work out.  It’s only the people that ignore the lesson or assume that the mistake was a one-time occurrence that will continually subject themselves to object lessons.  And those lessons will eventually hit home with the force of a sledgehammer.

Disruption in the New World of Networking

This is the one of the most exciting times to be working in networking. New technologies and fresh takes on existing problems are keeping everyone on their toes when it comes to learning new protocols and integration systems. VMworld 2013 served both as an annoucement of VMware’s formal entry into the larger networking world as well as putting existing network vendors on notice. What follows is my take on some of these announcements. I’m sure that some aren’t going to like what I say. I’m even more sure a few will debate my points vehemently. All I ask is that you consider my position as we go forward.

Captain Over, Captain Under

VMware, through their Nicira acquisition and development, is now *the* vendor to go to when you want to build an overlay network. Their technology augments existing deployments to provide software features such as load balancing and policy deployment. In order to do this and ensure that these features are utilized, VMware uses VxLAN tunnels between the devices. VMware calls these constructs “virtual wires”. I’m going to call them vWires, since they’ll likely be called that soon anyway. vWires are deployed between hosts to provide a pathway for communications. Think of it like a GRE tunnel or a VPN tunnel between the hosts. This means the traffic rides on the existing physical network but that network has no real visibility into the payload of the transit packets.

Nicira’s brainchild, NSX, has the ability to function as a layer 2 switch and a layer 3 router as well as a load balancer and a firewall. VMware is integrating many existing technologies with NSX to provide consistency when provisioning and deploying a new sofware-based network. For those devices that can’t be virtualized, VMware is working with HP, Brocade, and Arista to provide NSX agents that can decapsulate the traffic and send it to an physical endpoint that can’t participate in NSX (yet). As of the launch during the keynote, most major networking vendors are participating with NSX. There’s one major exception, but I’ll get to that in a minute.

NSX is a good product. VMware wouldn’t have released it otherwise. It is the vSwitch we’ve needed for a very long time. It also extends the ability of the virtualization/server admin to provision resources quickly. That’s where I’m having my issue with the messaging around NSX. During the second day keynote, the CTOs on stage said that the biggest impediment to application deployment is waiting on the network to be configured. Note that is my paraphrasing of what I took their intent to be. In order to work around the lag in network provisioning, VMware has decided to build a VxLAN/GRE/STT tunnel between the endpoints and eliminate the network admin as a source of delay. NSX turns your network in a fabric for the endpoints connected to it.

Under the Bridge

I also have some issues with NSX and the way it’s supposed to work on existing networks. Network engineers have spent countless hours optimizing paths and reducing delay and jitter to provide applications and servers with the best possible network. Now, that all doesn’t matter. vAdmins just have to click a couple of times and build their vWire to the other server and all that work on the network is for naught. The underlay network exists to provide VxLAN transport. NSX assumes that everything working beneath is running optimally. No loops, no blocked links. NSX doesn’t even participate in spanning tree. Why should it? After all, that vWire ensures that all the traffic ends up in the right location, right? People would never bridge the networking cards on a host server. Like building a VPN server, for instance. All of the things that network admins and engineers think about in regards to keeping the network from blowing up due to excess traffic are handwaved away in the presentations I’ve seen.

The reference architecture for NSX looks pretty. Prettier than any real network I’ve ever seen. I’m afraid that suboptimal networks are going to impact application and server performance now more than ever. And instead of the network using mechanisms like QoS to battle issues, those packets are now invisible bulk traffic. When network folks have no visibility into the content of the network, they can’t help when performance suffers. Who do you think is going to get blamed when that goes on? Right now, it’s the network’s fault when things don’t run right. Do you think that moving the onus for server network provisioning to NSX and vCenter is going to forgive the network people when things go south? Or are the underlay engineers going to be take the brunt of the yelling because they are the only ones that still understand the black magic outside the GUI drag-and-drop to create vWires?

NSX is for service enablement. It allows people to build network components without knowing the CLI. It also means that network admins are going to have to work twice as hard to build resilient networks that work at high speed. I’m hoping that means that TRILL-based fabrics are going to take off. Why use spanning tree now? Your application and service network sure isn’t. No sense adding any more bells and whistles to your switches. It’s better to just tie them into spine-and-leaf CLOS fabrics and be done with it. It now becomes much more important to concentrate on the user experience. Or maybe the wirless network. As long as at least one link exists between your ESX box and the edge switch let the new software networking guys worry about it.

The Recumbent Incumbent?

Cisco is the only major networking manufacturer not publicly on board with NSX right now. Their CTO Padma Warrior has released a response to NSX that talks about lock-in and vertical integration. Still others have released responses to that response. There’s a lot of talk right now about the war brewing between Cisco and VMware and what that means for VCE. One thing is for sure – the landscape has changed. I’m not sure how this is going to fall out on both sides. Cisco isn’t likely to stop selling switches any time soon. NSX still works just fine with Cisco as an underlay. VCE is still going to make a whole bunch of money selling vBlocks in the next few months. Where this becomes a friction point is in the future.

Cisco has been building APIs into their software for the last year. They want to be able to use those APIs to directly program the network through devices like the forthcoming OpenDaylight controller. Will they allow NSX to program them as well? I’m sure they would – if VMware wrote those instructions into NSX. Will VMware demand that Cisco use the NSX-approved APIs and agents to expose network functionality to their software network? They could. Will Cisco scrap OnePK to implement NSX? I doubt that very much. We’re left with a standoff. Cisco wants VMware to use their tools to program Cisco networks. VMware wants Cisco to use the same tools as everyone else and make the network a commodity compared to the way it is now.

Let’s think about that last part for a moment. Aside from some speed differences, networks are largely going to be identical to NSX. It won’t care if you’re running HP, Brocade, or Cisco. Transport is transport. Someone down the road may build some proprietary features into their hardware to make NSX run better but that day is far off. What if a manufacturer builds a switch that is twice as fast as the nearest competition? Three times? Ten times? At what point does the underlay become so important that the overlay starts preferring it exclusively?


Tom’s Take

I said a lot during the Tuesday keynote at VMworld. Some of it was rather snarky. I asked about full BGP tables and vMotioning the machines onto the new NSX network. I asked because I tend to obsess over details. Forgotten details have broken more of my networks than grand design disasters. We tend to fuss over the big things. We make more out of someone that can drive a golf ball hundreds of yards than we do about the one that can consistently sink a ten foot putt. I know that a lot of folks were pre-briefed on NSX. I wasn’t, so I’m playing catch up right now. I need to see it work in production to understand what value it brings to me. One thing is for sure – VMware needs to change the messaging around NSX to be less antagonistic towards network folks. Bring us into your solution. Let us use our years of experience to help rather than making us seem like pariahs responsible for all your application woes. Let us help you help everyone.

Why An iPhone Fingerprint Scanner Makes Sense

silver-apple-thumb

It’s hype season again for the Cupertino Fruit and Phone Company.  We are mere days away from a press conference that should reveal the specs of a new iPhone, likely to be named the iPhone 5S.  As is customary before these events, the public is treated to all manner of Wild Mass Guessing as to what will be contained in the device.  WIll it have dual flashes?  Will it have a slow-motion camera?  NFC? 802.11ac?  The list goes on and on.  One of the most spectacular rumors comes in a package the size of your thumb.

Apple quietly bought a company called AuthenTec last year.  AuthentTec made fingerprint scanners for a variety of companies, including those that included the technology in some Android devices.  After the $365 million acquisition, AuthenTec disappeared into a black hole.  No one (including Apple) said much of anything about them.  Then a few weeks ago, a patent application was revealed that came from Apple and included fingerprint technology from AuthenTec.  This sent the rumor mill into overdrive.  Now all signs point to a convex sapphire home button that contains a fingerprint scanner that will allow iPhones to use biometrics for security.  A developer even managed to ferret out a link to a BiometrickKitUI bundle in one of the iOS 7 beta releases (which was quickly removed in the next beta).

Giving Security The Finger

I think adding a fingerprint scanner to the hardware of an iDevice is an awesome idea.  Passcode locks are good for a certain amount of basic device security, but the usefulness of a passcode is inversely proportional to it’s security level.  People don’t make complex passcodes because they take far too long to type in.  If you make a complex alphanumeric code, typing the code in quickly one-handed isn’t easy.  That leaves most people choosing to use a 4-digit code or forgoing it altogether.  That doesn’t bode well for people whose phones are lost or stolen.

Apple has already publicly revealed that it will include enhanced security in iOS 7 in the form of an activation lock that prevents a thief from erasing the phone and reactivating it for themselves.  This makes sense in that Apple wants to discourage thieves.  But that step only makes sense if you consider that Apple wants to beef up the device security as well.  Biometric fingerprint scanners are a quick method of inputting a unique unlock code quickly.  Enabling this technology on a new phone should show a sharp increase in the number of users that have enabled an unlock code (or finger, in this case).

Not all people thing fingerprint scanners are a good idea.  A link from Angelbeat says that Apple should forget about the finger and instead use a combination of picture and voice to unlock the phone.  The writer says that this would provide more security because it requires your face as well as your voice.  The writer also says that it’s more convenient than taking a glove off to use a finger in cold weather.  I happen to disagree on a couple of points.

A Face For Radio

Facial recognition unlock for phones isn’t new.  It’s been in Android since the release of Ice Cream Sandwich.  It’s also very easy to defeat.  This article from last year talks about how flaky the system is unless you provide it several pictures to reference from many different angles.  This video shows how snapping a picture on a different phone can easily fool the facial recognition.  And that’s only the first video of several that I found on a cursory search for “Android Facial Recognition”.  I could see this working against the user if the phone is stolen by someone that knows their target.  Especially if there is a large repository of face pictures online somewhere.  Perhaps in a “book” of “faces”.

Another issue I have is Siri.  As far as I know, Siri can’t be trained to recognize a users voice.  In fact, I don’t believe Siri can distinguish one user from another at all.  To prove my point, go pick up a friend’s phone and ask Siri to find something.  Odds are good Siri will comply even though you aren’t the phone’s owner.  In order to defeat the old, unreliable voice command systems that have been around forever, Apple made Siri able to recognize a wide variety of voices and accents.  In order to cover that wide use case, Apple had to sacrifice resolution of a specific voice.  Apple would have to build in a completely new set of Siri APIs that query a user to speak a specific set of phrases in order to build a custom unlock code.  Based on my experience with those kinds of old systems, if you didn’t utter the phrase exactly the way it was originally recorded it would fail spectacularly.  What happens if you have a cold?  Or there is background noise?  Not exactly easier that putting your thumb on a sensor.

Don’t think that means that fingerprints are infallible.  The Mythbusters managed to defeat an unbeatable fingerprint scanner in one episode.  Of course, they had access to things like ballistics gel, which isn’t something you can pick up at the corner store.  Biometrics are only as good as the sensors that power them.  They also serve as a deterrent, not a complete barrier.  Lifting someone’s fingerprints isn’t easy and neither is scanning them into a computer to produce a sharp enough image to fool the average scanner.  The idea is that a stolen phone with a biometric lock will simply be discarded and a different, more vulnerable phone would be exploited instead.


Tom’s Take

I hope that Apple includes a fingerprint scanner in the new iPhone.  I hope it has enough accuracy and resolution to make biometric access easy and simple.  That kind of implementation across so many devices will drive the access control industry to take a new look at biometrics and being integrating them into more products.  Hopefully that will spur things like home door locks, vehicle locks, and other personal devices to being using these same kind of sensors to increase security.  Fingerprints aren’t perfect by any stretch, but they are the best option of the current generation of technology.  One day we may reach the stage of retinal scanners or brainwave pattern matches for security locks.  For now, a fingerprint scanner on my phone will get a “thumbs up” from me.

SDN and NFV – The Ups and Downs

TopSDNBottomNFV

I was pondering the dichotomy between Software Defined Networking (SDN) and Network Function Virtualization (NFV) the other day.  I’ve heard a lot of vendors and bloggers talking about how one inevitably leads to the other.  I’ve also seen a lot of folks saying that the two couldn’t be further apart on the scale of software networking.  The more I thought about these topics, the more I realized they are two sides of the coin.  The problem, at least in my mind, is the perspective.

SDN – Planning The Paradigm

Software Defined Networking telegraphs everything about what it is trying to accomplish right there in the name.  Specifically, the “Definition” part of the phrase.  I’ve made jokes in the past about the lack of definition in SDN as vendors try to adapt their solutions to fit the buzzword mold.  What I finally came to realize is that the SDN folks are all about definition. SDN is the Top Down approach to planning.  SDN seeks to decompose the network into subsystems that can be replaced or reprogrammed to suit the needs of those things which utilize the network.

As an example, SDN breaks the idea of switch down into things like “forwarding plane” and “control plane” and seeks to replace the control plane with alternative software, whether it be a controller-based architecture like OpenFlow or an overlay network similar to that of VMware/Nicira.  We can replace the OS of a switch with a concept like OpenFlow easily.  It’s just a mechanism for determining which entries are populated in the Content Addressable Memory (CAM) tables of the forwarding plane.  In top down design, it’s easy to create a stub entry or “black box” to hold information that flows into it.  We don’t particularly care how the black box works from the top of the design, just that it does its job when called upon.

Top Down designs tend to run into issues when those black boxes lack detail or are missing some critical functionality.  What happens when OpenFlow isn’t capable of processing flows fast enough to keep the CAM table of a campus switch populated with entries?  Is the switch going to fall back to process switching the packets?  That could be an big issue.  Top Down designs are usually very academic and elegant.  They also have a tendency to lack concrete examples and real world experience.  When you think about it, that does speak a lot about the early days of SDN – lots of definition of terminology and technology, but a severe lack of actual packet forwarding.

NFV – Working From The Ground Up

Network Function Virtualization takes a very different approach to the idea of turning hardware networks into software networks.  The driving principle behind NFV is replication of existing technology in a software state.  This is classic Bottom Up design.  Rather than spending a large amount of time planning and assembling the perfect system, Bottom Up designers tend to build as they go.  They concentrate on making something work first, then making their things work together second.

NFV is great for hands-on folks because it gives concrete, real results almost immediately. Once you’ve converted an load balancer or a router to a purely software-based construct you can see right away how it works and what the limitations might be.  Does it consume too many resources on the hypervisor?  Does it excel at forwarding small packets?  Does switching a large packet locally cause a fault?  These are problems that can be corrected in the individual system rapidly rather than waiting to modify the overall plan to account for difficulties in the virtualization process.

Bottom Up design does suffer from some issues as well.  The focus in Bottom Up is on getting things done on a case-by-case basis.  What do you do when you’ve converted all your hardware to software?  Do your NFV systems need to talk to one another?  That’s usually where Bottom Up design starts breaking down.  Without a grand plan at a higher level to ensure that systems can talk to each other this design methodology falls back to a series of “hacks” to get them connected.  Units developed in isolation aren’t required to play nice with everyone else until they are forced to do so.  That leads to increasing complex and fragile interconnection systems that could fail spectacularly should the wrong thread be yanked with sufficient force.


Tom’s Take

Which method is better?  Should we spend all our time planning the system and hope that our Powerpoint Designs work the right way when someone codes them in a few months?  Or should we say “damn the torpedoes” and start building things left and right and hope that someone will figure out a way to tie all these individual pieces together at some point?

Surprisingly, the most successful design requires elements of both.  People need to have a basic plan at the least when starting out on a plan to change the networking world.  Once the ideas are sketched out, you need a team of folks willing to burn the midnight oil and get the ideas implemented in real life to ensure that the plan works the right way.  The guidance from the top is essential to making sure everything works together in the end.

Whether you are leading from the top or the bottom, remember that everything has to meet in the middle sooner or later.

Layoffs – Blessing or Curse?

LayoffsSM

On August 15, Cisco announced that it would be laying off about 4,000 workers across various parts of the organization.  The timing of the announcement comes after the end of Cisco’s fiscal year.  Most of the times that Cisco has announced large layoffs of this sort, it comes in the middle of August after they analyze their previous year’s performance.  Reducing their workforce by 5% isn’t inconsequential by any means.  For the individual employee, a layoff means belt tightening and resume updating.  It’s never a good thing.  But taking the layoffs in a bigger frame of mind, I think this reduction in force will have some good benefits on both sides.

Good For The Goose

If the headline had instead read “Cisco Removes Two Product Lines No One Uses Anymore” I think folks would have cheered.  Cisco is forever being chastised that it needs to focus on the core networking strategy and stop looking at all these additional market adjacencies.  Cisco made 13 acquisitions in the last twelve months.  Some of them were rather large, like Meraki and Sourcefire.  Consolidating development and bringing that talent on board almost certainly would have required that some other talent be removed.  Suppose that the layoffs really did only come from product lines that had been removed, such as the Application Control Engine (ACE).  Is it bad that Cisco is essentially pruning away unneeded product lines?  With the storm of software defined networking on the horizon, I think a slimmer, more focused Cisco is going to come out better in the long run.  Given that the powers that be at Cisco are actively trying to transform into a software company, I’d bet that this round of layoffs likely serve to refocus the company towards that end.

Good For The Gander

Cisco’s loss is the greater industry’s gain.  You’ve got about 4,000 very bright people looking for work in the industry now.  Startups and other networking companies should be snapping those folks up as soon as they come onto the market.  I’m sure there’s a hotshot startup out there yelling at their screen as I type this about how they don’t want to hire some washed-up traditional network developer and their hidebound thinking.  You know what those old fuddy duddies bring to your environment?  Experience.  They’ve made a ton of mistakes and learned from all of them.  Odds are good they won’t be making the same ones in your startup.  They also bring a calm voice of reason that tells you not ship this bug-ridden batch of code and instead tell the venture capital mouthpieces to shove it for a week while you keep this API from deleting all the data in the payroll system when you query it from Internet Explorer.  But, if you don’t want that kind of person keeping you from shooting yourself in the foot with a Howitzer then you don’t really care who is being laid off this week.  Unless it just happens to be you.


Tom’s Take

Layoffs suck.  Having been a part of a couple in my life I can honestly say that the uncertainty and doubt of not having a job tomorrow weighs heavily on the mind.  The bright side is that you have an opportunity to go out and make an impact in the world that you might not have otherwise had if you had been at your old position.  Likewise, if a company is laying off folks for the right reasons then things should work out well for them.  If the layoffs serve to refocus the business or change a line of thinking that is an acceptable loss.  If it’s just a cost cutting measure to serve the company up on a silver platter for acquisition or a shameless attempt to boost the bottom line and get a yearly bonus that’s not the right way to do things.  Companies and talent are never immediately better off when layoffs occur.  In the end you have to hope that it all works out for everyone.