What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

Facebook’s Mattress Problem with Privacy

If you haven’t had a chance to watch the latest episode of the Gestalt IT Rundown that I do with my co-workers every Wednesday, make sure you check this one out. Because it’s the end of the year it’s customary to do all kinds of fun wrap up stories. This episode focused on what we all thought was the biggest story of the year. For me, it was the way that Facebook completely trashed our privacy. And worse yet, I don’t see a way for this to get resolved any time soon. Because of the difference between assets and liabilities.

Contact The Asset

It’s no secret that Facebook knows a ton about us. We tell it all kinds of things every day we’re logged into the platform. We fill out our user profiles with all kinds of interesting details. We click Like buttons everywhere, including the one for the Gestalt IT Rundown. Facebook then keeps all the data somewhere.

But Facebook is collecting more data than that. They track where our mouse cursors are in the desktop when we’re logged in. They track the amount of time we spend with the mobile app open. They track information in the background. And they collect all of this secret data and they store it somewhere as well.

This data allows them to build an amazingly accurate picture of who we are. And that kind of picture is extremely valuable to the right people. At first, I thought it might be the advertisers that crave this kind of data. Advertisers are the people that want to know exactly who is watching their programs. The more data they have about demographics the better they can tailor the message. We’ve already seen that with specific kinds of targeted posts on Facebook.

But the people that really salivate over this kind of data live in the shadows. They look at the data as a way to offer new kinds of services. Don’t just sell people things. Make them think differently. Change their opinions about products or ideas without them even realizing it. The really dark and twisted stuff. Like propaganda on a whole new scale. Enabled by the fact that we have all the data we could ever want on someone without even needing to steal it from them.

The problem with Facebook collecting all this data about us is that it’s an asset. It’s not too dissimilar from an older person keeping all their money under a mattress. We scoff at that person because a mattress is a terrible place to keep money. It’s not safe. And a bank will pay you keep your money there, right?

On the flip side, depending on the age of that person, they may not believe that banks are safe. Before FDIC, there was no guarantee your money would be repaid in a pinch. And if the bank goes out of business you can’t get your investment back. For a person that lived through the Great Depression that had to endure bank holidays and the like, keeping your asset under a mattress is way safer than giving it to someone else.

As an aside here, remember that banks don’t like leaving your money laying around either. If you deposit money in a bank, they take that money and invest it in other places. They put the money to work for them making money. The interest that you get paid for savings accounts and the like is just a small bonus to encourage you to keep your money in the bank and not to pull it out. That’s why they even have big disclaimers saying that your money may not be available to withdraw at a moment’s notice. Because if you do decide to get all of your money out of the bank at once, they need to go find the money to give you.

Now, let’s examine our data. Or, at least the data that Facebook has been storing on us. How do you think Facebook looks at that data? Do you believe they want to keep it under the mattress where it’s safe from the outside world? Do you think that Facebook wants to keep all these information locked in a vault somewhere where no one can get to it?

Or perhaps Facebook looks at your data as an asset like a bank does. Instead of keeping it around and letting it sit fallow they’d rather put it to work. That’s the nature of a valuable asset. To the average person, their privacy is one of the most important parts of their lives. To Facebook, your privacy is simply an asset. It can either sit by itself and make them nothing. Or it can be put to use by Facebook or third-party companies to make more money from the things that they can do with good data sources. To believe that a company like Facebook has your best interests at heart when it comes to privacy is not a good bet to make.

Would I Lie-ability To You?

In fact, the only thing that can make Facebook really sit up and pay attention is if that asset they have farmed out and working for them were to suddenly become a liability for some reason. Liabilities are a problem for companies because they are the exact opposite of making money. They cost money. Just as the grandmother in the above example sees an insolvent bank as a liability, so too would someone see a bad asset as a possible exposure.

Liabilities are a problem. Anything that can be an exposure is an issue for company, especially one with investors that like to get dividends. Any reduction in profit equals a loss. Liabilities on a balance sheet are giant red flags for anyone taking a close look at the operations of a business.

Turning Facebook’s data assets into a liability is the only way to make them sit up and realize that what they’re doing is wrong. Selling access to our data to anyone that wants it is a horrible idea. But it won’t stop until there is some way to make them pay through he nose for screwing up. Up until this year, that was a long shot at best. Most fines were in the thousands of dollars range, whereas most companies would pay millions for access to data. A carefully crafted statement admitting no fault after the exposure was uncovered means that Facebook and the offending company get away without a black mark and get to pocket all their gains.

The European GDPR law is a great step in the right direction. It clearly spells out what has to happen to keep a person’s data safe. That eliminates wiggle room in the laws. It also puts a stiff fine in place to ensure that any violations can be compounded quickly to drain a company and turn data into a liability instead of an asset. There are moves in the US to introduce legislation similar to GDPR, either at the federal level or in individual states like California, the location of Facebook’s headquarters.

That’s not to say that these laws are going to get it right every time. There are people out there that live to find ways to turn liabilities into assets. They want to find ways around the laws and make it so that they can continue to take their assets and make money from them even if the possibility of exposure is high. It’s one thing when that exposure is the money of people that invested in them. It’s another thing entirely when it’s personally identifiable information (PII) or protected information about people. We’re not imaginary money. We live and breath and exist long past losses. And trying to get our life back on track after an exposure is not easy for sure.


Tom’s Take

If I sound grumpy, it’s because I am tired of this mess. When I was researching my discussion for the Gestalt IT Rundown I simply Googled “Facebook data breach 2018” looking for examples that weren’t Cambridge Analytica. The number was more than it should have been. We cry about Target and Equifax and many other exposures that have happened in the last five years, but we also punish those companies by not doing business with them or moving our information elsewhere. Facebook has everyone hooked. We share photos on Facebook. We RSVP to events on Facebook. And we talk to people on Facebook as much or more than we do on the phone. That kind of reach requires a company to be more careful with who has access to our data. And if the solution is building the world’s biggest mattress to keep it all safe put me down for a set of box springs.

 

The Cargo Cult of Google Tools

You should definitely watch this amazing video from Ben Sigelman of LightStep that was recorded at Cloud Field Day 4. The good stuff comes right up front.

In less than five minutes, he takes apart crazy notions that we have in the world today. I like the observation that you can’t build a system more than three or four orders of magnitude. Yes, you really shouldn’t be using Hadoop for simple things. And Machine Learning is not a magic wand that fixes every problem.

However, my favorite thing was the quick mention of how emulating Google for the sake of using their tools for every solution is folly. Ben should know, because he is an ex-Googler. I think I can sum up this entire discussion in less than a minute of his talk here:

Google’s solutions were built for scale that basically doesn’t exist outside of a maybe a handful of companies with a trillion dollar valuation. It’s foolish to assume that their solutions are better. They’re just more scalable. But they are actually very feature-poor. There’s a tradeoff there. We should not be imitating what Google did without thinking about why they did it. Sometimes the “whys” will apply to us, sometimes they won’t.

Gee, where have I heard something like this before? Oh yeah. How about this post. Or maybe this one on OCP. If I had a microphone I would have handed it to Ben so he could drop it.

Building a Laser Moustrap

We’ve reached the point in networking and other IT disciplines where we have built cargo cults around Facebook and Google. We practically worship every tool they release into the wild and try to emulate that style in our own networks. And it’s not just the tools we use, either. We also keep trying to emulate the service provider style of Facebook and Google where they treated their primary users and consumers of services like your ISP treats you. That architectural style is being lauded by so many analysts and forward-thinking firms that you’re probably sick of hearing about it.

Guess what? You are not Google. Or Facebook. Or LinkedIn. You are not solving massive problems at the scale that they are solving them. Your 50-person office does not need Cassandra or Hadoop or TensorFlow. Why?

  • Google Has Massive Scale – Ben mentioned it in the video above. The published scale of Google is massive, and even it’s on the low side of the number. The real numbers could even be an order of magnitude higher than what we realize. When you have to start quoting throughput numbers in “Library of Congress” numbers to make sense to normal people, you’re in a class by yourself.
  • Google Builds Solutions For Their Problems – It’s all well and good that Google has built a ton of tools to solve their issues. It’s even nice of them to have shared those tools with the community through open source. But realistically speaking, when are you really going to use Cassandra to solve all but the most complicated and complex database issues? It’s like a guy that goes out to buy a pneumatic impact wrench to fix the training wheels on his daughter’s bike. Sure, it will get the job done. But it’s going to be way overpowered and cause more problems than it solves.
  • Google’s Tools Don’t Solve Your Problems – This is the crux of Ben’s argument above. Google’s tools aren’t designed to solve a small flow issue in an SME network. They’re designed to keep the lights on in an organization that maps the world and provides video content to billions of people. Google tools are purpose built. And they aren’t flexible outside that purpose. They are built to be scalable, not flexible.

Down To Earth

Since Google’s scale numbers are hard to comprehend, let’s look at a better example from days gone by. I’m talking about the Cisco Aironet-to-LWAPP Upgrade Tool:

I used this a lot back in the day to upgrade autonomous APs to LWAPP controller-based APs. It was a very simple tool. It did exactly what it said in the title. And it didn’t do much more than that. You fed it an image and pointed it at an AP and it did the rest. There was some magic on the backend of removing and installing certificates and other necessary things to pave the way for the upgrade, but it was essentially a batch TFTP server.

It was simple. It didn’t check that you had the right image for the AP. It didn’t throw out good error codes when you blew something up. It only ran on a maximum of 5 APs at a time. And you had to close the tool every three or four uses because it had a memory leak! But, it was a still a better choice than trying to upgrade those APs by hand through the CLI.

This tool is over ten years old at this point and is still available for download on Cisco’s site. Why? Because you may still need it. It doesn’t scale to 1,000 APs. It doesn’t give you any other functionality other than upgrading 5 Aironet APs at a time to LWAPP (or CAPWAP) images. That’s it. That’s the purpose of the tool. And it’s still useful.

Tools like this aren’t built to be the ultimate solution to every problem. They don’t try to pack in every possible feature to be a “single pane of glass” problem solver. Instead, they focus on one problem and solve it better than anything else. Now, imagine that tool running at a scale your mind can’t comprehend. And you’ll know now why Google builds their tools the way they do.


Tom’s Take

I have a constant discussion on Twitter about the phrase “begs the question”. Begging the question is a logical fallacy. Almost every time the speaker really means “raises the question”. Likewise, every time you think you need to use a Google tool to solve a problem, you’re almost always wrong. You’re not operating at the scale necessary to need that solution. Instead, the majority of people looking to implement Google solutions in their networks are like people that put chrome everything on a car. They’re looking to show off instead of get things done. It’s time to retire the Google Cargo Cult and instead ask ourselves what problems we’re really trying to solve, as Ben Sigelman mentions above. I think we’ll end up much happier in the long run and find our work lives much less complicated.

Facebook Wedge 100 – The Future of the Data Center?

 

FBLike

Facebook is back in the news again. This time, it’s because of the release of their new Wedge 100 switch into the Open Compute Project (OCP). Wedge was already making headlines when Facebook announced it two years ago. A fast, open sourced 40Gig Top-of-Rack (ToR) switch was huge. Now, Facebook is letting everyone in on the fun of a faster Wedge that has been deployed into production at Facebook data centers as well as being offered for sale through Edgecore Networks, which is itself a division of Accton. Accton has been leading the way in the whitebox switching market and Wedge 100 may be one of the ways it climbs to the top.

Holy Hardware!

Wedge 100 is pretty impressive from the spec sheet. They paid special attention to making sure the modules were expandable, especially for faster CPUs and special purpose devices down the road. That’s possible because Wedge is a highly specialized micro server already. Rather than rearchitecting the guts of the whole thing, Facebook kept the CPU and the monitoring stack and just put newer, faster modules on it to ramp to 32x100Gig connectivity.

12809187_1676340369272065_1831349201_n

As suspected in the above image, Facebook is using Broadcom Tomahawk as the base connectivity in their switch, which isn’t surprising. Tomahawk is the roadmap for all vendors to get to 100Gig. It also means that the downlink connectivity for these switches could conceivably work in 25/50Gig increments. However, given the enormous amount of east/west traffic that Facebook must generate, Facebook has created a server platform they call Yosemite that has 100Gig links as well. Given the probably backplane there, you can imagine the data that’s getting thrown around the data centers.

That’s not all. Omar Baldonado has said that they are looking at going to 400Gig connectivity soon. That’s the kind of mind blowing speed that you see in places like Google and Facebook. Remember that this hardware is built for a specific purpose. They don’t just have elephant flows. They have flows the size of an elephant herd. That’s why they fret about the operating temperature of optics or the rack design they want to use (standard versus Open Racks). Because every little change matters a thousand fold at that scale.

Software For The People

The other exciting announcement from Facebook was on the software front. Of course, FBOSS has been updated to work with Wedge 100. I found it very interesting in the press release that much of the programming in FBOSS went into interoperability with Wedge 40 and with fixing the hardware side of things. This makes some sense when you realize that Facebook didn’t need to spend a lot of time making Wedge 40 interoperate with anything, since it was a wholesale replacement. But Wedge 100 would need to coexist with Wedge 40 as the rollout happens, so making everything play nice is a huge point on the checklist.

The other software announcement that got the community talking was support for third-party operating systems running on Wedge 100. The first one up was Open Network Linux from Big Switch Networks. ONL ran on the original Wedge 40 and now runs on the Wedge 100. This means that if you’re familiar with running BSN OSes on your devices, you can drop in a Wedge 100 in your spine or fabric and be ready to go.

The second exciting announcement about software comes from a new company, Apstra. Apstra announced their entry into OCP and their intent to get their Apstra Operating System (AOS) running on Wedge 100 by next year. That has a big potential impact for Apstra customers that want to deploy these switches down the road. I hope to hear more about this from Apstra during their presentation at Networking Field Day 13 next month.


Tom’s Take

Facebook is blazing a trail for fast ToR switches. They’ve got the technical chops to build what they need and release the designs to the rest of the world to be used for a variety of ideas. Granted, your data center looks nothing like Facebook. But the ideas they are pioneering are having an impact down the line. If Open Rack catches on you may see different ideas in data center standardization. If the Six Pack catches on as a new chassis concept, it’s going to change spines as well.

If you want to get your hands dirty with Wedge, build a new 100Gig pod and buy one from Edgecore. The downlinks can break out into 10Gig and 25Gig links for servers and knowing it can run ONL or Apstra AOS (eventually) gives you some familiar ground to start from. If it runs as fast as they say it does, it may be a better investment right now than waiting for Tomahawk II to come to your favorite vendor.

 

 

Is The Blog Dead?

I couldn’t help but notice an article that kept getting tweeted about and linked all over the place last week.  It was a piece by Jason Kottke titled “R.I.P. The Blog, 1997-2013“.  It’s actually a bit of commentary on a longer piece he wrote for the Nieman Journalism Lab called “The Blog Is Dead, Long Live The Blog“.  Kottke talks about how people today are more likely to turn to the various social media channels to spread their message rather than the tried-and-true arena of the blog.

Kottke admits in both pieces that blogging isn’t going away.  He even admits that blogging is going to be his go-to written form for a long time to come.  But the fact that the article spread around like wildfire got me to thinking about why blogging is so important to me.  I didn’t start out as a blogger.  My foray into the greater online world first came through Facebook.  Later, as I decided to make it more professional I turned to Twitter to interact with people.  Blogging wasn’t even the first thing on my mind.  As I started writing though, I realized how important it is to the greater community.  The reason?  Blogging is thought without restriction.

Automatic Filtration

Social media is wonderful for interaction.  It allows you to talk to friends and followers around the world.  I’m still amazed when I have conversations in real time with Aussies and Belgians.  However, social media facilitates these conversations through an immense filtering system.  Sometimes, we aren’t aware of the filters and restrictions placed on our communications.

twitter02_color_128x128Twitter forces users to think in 140-ish characters.  Ideas must be small enough to digest and easily recirculate.  I’ve even caught myself cutting down on thoughts in order to hit the smaller target of being about to put “RT: @networkingnerd” at the begging for tweet attribution.  Part of the reason I started a blog was because I had thoughts that were more than 140 characters long.  The words just flow for some ideas.  There’s no way I could really express myself if I had to make ten or more tweets to express what I was thinking on a subject.  Not to mention that most people on Twitter are conditioned to unfollow prolific tweeters when they start firing off tweet after tweet in rapid succession.

facebook_color02_128x128Facebook is better for longer discussion, but they are worse from the filtering department. The changes to their news feed algorithm this year weren’t the first time that Facebook has tweaked the way that users view their firehose of updates.  They believe in curating a given users feed to display what they think is relevant.  At best this smacks of arrogance.  Why does Facebook think they know what’s more important to me that I do?  Why must my Facebook app always default to Most Important rather than my preferred Most Recent?  Facebook has been searching for a way to monetize their product even before their rocky IPO.  By offering advertisers a prime spot in a user’s news feed, they can guarantee that the ad will be viewed thanks to the heavy handed way that they curate the feed.  As much reach as Facebook has, I can’t trust them to put my posts and articles where they belong for people that want to read what I have to say.

Other social platforms suffer from artificial restriction.  Pinterest is great for those that post with picture and captions or comments.  It’s not the best for me to write long pieces, especially when they aren’t about a craft or a wish list for gifts.  Tumblr is more suited for blogging, but the comment system is geared toward sharing and not constructive discussion.  Add in the fact that Tumblr is blocked in many enterprise networks due to questionable content and you can see how limiting the reach of a single person can be when it comes to corporate policy.  I had to fight this battle in my old job more than once in order to read some very smart people that blogged on Tumblr.

Blogging for me is about unrestricted freedom to pour out my thoughts.  I don’t want to worry about who will see it or how it will be read.  I want people to digest my thoughts and words and have a reaction.  Whether they choose to share it via numerous social media channels or leave a comment makes no difference to me.  I like seeing people share what I’ve committed to virtual paper.  A blog gives me an avenue to write and write without worry.  Sometimes that means it’s just a few paragraphs about something humorous.  Other times it’s an activist rant about something I find abhorrent.  The key is that those thoughts can co-exist without fear of being pigeonholed or categorized by an algorithm or other artificial filter.


Tom’s Take

Sometimes, people make sensationalist posts to call attention to things.  I’ve done it before and will likely do it again in the future.  The key is to read what’s offered and make your own conclusion.  For some, that will be via retweeting or liking.  For others, it will be adding a +1 or a heart.  For me, it’s about collecting my thoughts and pouring them out via a well-worn keyboard on WordPress.  It’s about sharing everything rattling around in my head and offering up analysis and opinion for all to see.  That part isn’t going away any time soon, despite what others might say about blogging in general.  So long as we continue to express ourselves without restriction, the blog will never really die no matter how we choose to share it.

Why Facebook’s Open Compute Switches Don’t Matter. To You.

Facebook_iconFacebook announced at Interop that they are soliciting ideas for building their own top-of-rack (ToR) switch via the Open Compute Project.  This sent the tech media into a frenzy.  People are talking about the end of the Cisco monopoly on switches.  Others claimed that the world would be a much different place now that switches are going to be built by non-vendors and open sourced to everyone.  I yawned and went back to my lunch.  Why?

BYO Networking Gear

As you browse the article that you’re reading about how Facebook is going to destroy the networking industry, do me a favor and take note of what kind of computer you’re using.  Is is a home-built desktop?  Is it something ordered from a vendor?  Is it a laptop or mobile device that you built? Or bought?

The idea that Facebook is building switches isn’t far fetched to me.  They’ve been doing their own servers for a while.  That’s because their environment looks wholly different than any other enterprise on the planet, with the exception of maybe Google (who also builds their own stuff).  Facebook has some very specialized needs when it comes to servers and to networking.  As they mention at conferences, the amount of data rushing into their network on an hourly, let alone daily, basis is mind boggling.  Shaving milliseconds off query times or reducing traffic by a few KB per flow translates into massive savings when you consider the scale they are operating at.

To that end, anything they can do to optimize their equipment to meet their needs is going to be a big deal.   They’ve got a significant motivation to ensure that the devices doing the heavy lifting for them are doing the best job they can.  That means they can invest a significant amount of capital into building their own network devices and still get a good return on the investment.  Much like the last time I built my own home desktop.  I didn’t find a single machine that met all of my needs and desires.  So I decided to cannibalize some parts out of an old machine and just build the rest myself.  Sure, it took me about a month to buy all the parts, ship them to my house, and then assemble the whole package together.  But in the end I was very happy with the design.  In fact, I still use it at home today.

That’s not to say that my design is the best for everyone, or anyone for that matter.  The decisions I made in building my own computer were one’s that suited me.  In much the same way, Facebook’s ToR switches probably serve very different needs than existing data centers.  Are your ToR switches optimized for east-west traffic flow?  I don’t see a lot of data at Facebook directed to other internal devices.  I think Facebook is really pushing their systems for north-south flow.  Data requests coming in from users and going back out to them are more in line with what they’re doing.  If that’s the case, Facebook will have a switch optimized for really fast data flows.  Only they’ll be flowing in the wrong direction for what most people are using data flows for today.  It’s like having a Bugatti Veyron and living in a city with dirt roads.

Facebook admitted that there are things about networking vendors they don’t like.  They don’t want to be locked into a proprietary OS like IOS, EOS, or Junos.  They want a whitebox solution that will run any OS on the planet efficiently.  I think that’s because they don’t want to get locked into a specific hardware supplier either.  They want to buy what’s cheapest at the time and build large portions of their network rapidly as needed to embrace new technology and data flows.  You can’t get married to a single supplier in that case.  If you do, a hiccup in the production line or a delay could cost you thousands, if not millions.  Just look at how Apple ensures diversity in the iPhone supply chain to get an idea of what Facebook is trying to do.  If Apple were to lose a single part supplier there would be chaos in the supply chain.  In order to ensure that everything works like a well-oiled machine, they have multiple companies supplying each outsourced part.  I think Facebook is driving for something simliar in their switch design.

One Throat To Choke

The other thing that gives me pause here is support.  I’ve long held that one of the reasons why people still buy computers from vendors or run Windows and OS X on machines is because they don’t want the headache of fixing things.  A warranty or support contract is a very reassuring thing.  Knowing that you can pick up the phone and call someone to get a new power supply or tell you why you’re getting a MAC flap error lets you go to sleep at night.  When you roll your own devices, the buck stops with you when you need to support something.  Can’t figure out how to get your web server running on Ubuntu?  Better head to the support forums.  Wondering why your BYOSwitch is dropping frames under load?  Hope you’re a Wireshark wizard.  Most enterprises don’t care that a support contract costs them money.  They want the assurance that things are going to get fixed when they break.  When you develop everything yourself, you are putting a tremendous amount of faith into those developers to ensure that bugs are worked out and hardware failures are taken care of.  Again, when you consider the scale of what Facebook is doing, the idea of having purpose-build devices makes sense.  It also makes sense that having people on staff that can fix those specialized devices is cost effective for you.

Face it.  The idea that Facebook is going to destroy the switching market is ludicrous.  You’re never going to buy a switch from Facebook.  Maybe you want to tinker around with Intel’s DPDK with a lab switch so you can install OpenFlow or something similar.  But when it comes time to forklift the data center or populate a new campus building with switches, I can almost guarantee that you’re going to pick up the phone and call Cisco, Arista, Juniper, Brocade, or HP.  Why?  Because they can build those switches faster than you can.  Because even though they are a big captial expenditure (capex), it’s still cheaper in the long run if you don’t have the time to dedicate to building your own stuff.  And when something blows up (and something always blows up), you’re going to want a TAC engineer on the phone sharing the heat with you when the CxOs come headhunting in the data center when everything goes down.

Facebook will go on doing their thing their way with their own servers and switches.  They’ll do amazing things with data that you never dreamed possible.  But just like buying a Sherman tank for city driving, their solution isn’t going to work for most people.  Because it’s built by them for them.  Just like Google’s server farms and search appliances.  Facebook may end up contributing a lot to the Open Compute Project and advancing the overall knowledge and independence of networking hardware.  But to think they’re starting a revolution in networking is about as far fetched as thinking that Myspace was going to be the top social network forever.