While at the Software Defined Data Center Symposium, I had the good fortune to moderate a panel focused on application focused networking in the data center. There were some really smart engineers on that panel. One of the most impressive people was Najam Ahmad from Facebook. He is their Director of Technical Operations. He told me some things about Facebook that made me look at what they are doing a in a new light.
Najam said when I asked him about stakeholder perceptions that he felt a little out of sorts on stage because Ivan Pepelnjak (@IOSHints) and David Cheperdak (@DavidCheperdak) had spent the last fifteen minutes talking about virtual networking. Najam said that he didn’t really know what a hypervisor or a vSwitch were because they don’t run them at Facebook. All of their operating systems and servers run directly on bare metal. That shocked me a bit. Najam said that inserting anything in between the server and what its function was added unnecessary overhead. That’s a pretty unique take on things when you look at how many data centers are driving toward full virtualization.
Old Tools, New Uses
Facebook also runs BGP to the top-of-rack (ToR) switches in their environment. That means that they are doing layer 3 all the way to their access layer. What’s funny is that while BGP in the ToR switches provides for scalability and resiliency, they don’t use BGP as their primary protocol when exchanging routes with providers. For Facebook, BGP at the edge of doesn’t provide enough control over network egress. They take the information that BGP is providing and they crunch it a bit further before adding that all into a controller-based solution that applies business logic and policies to determine the best solution for a given network scenario.
Najam also said that they had used NetFlow for a while to collect data from their servers in order to build a picture of what was going on inside the network. What they found is that the collectors were becoming overwhelmed by the amount of data that they were being hammered with. So instead of installing bigger, faster collectors the Facebook engineers broke the problem apart by putting a small shim program on every server to collect the data and then forward to a system designed to collect data inputs, not just NetFlow inputs. Najam lovingly called their system “FBFlow”.
I thought about this for a while before having a conversation with Colin McNamara (@ColinMcNamara). He told me that this design was a lot more common than I previously thought and that he had implemented it a few times already. At service providers. That’s when things really hit home for me.
Facebook is doing the same things that you do in your data center today. They’re just doing it at a scale that’s one or two orders of magnitude bigger. The basics are all still there: Facebook pushes packets around a network to feed servers and provide applications for consumption by users. What is so different is that the scale at which Facebook does this begins to look less and less like a traditional data center and more and more like a service provider. After all, they *are* providing a service to their users.
I’ve talked before about how Facebook’s Open Compute Project (OCP) switch wasn’t going to be the death knell for traditional networking. Now you see some of that validated in my opinion. Facebook is building hardware to meet their needs because they are a strange hybrid of data center and service provider. Things that we would do successfully in a 500 VM system don’t scale at all for them. Crazy ideas like running exterior gateway routing protocols on ToR switches work just fine for them because of the scale at which they are operating.
Which brings me to the title of the post. People are always holding Facebook and Google in such high regard for what they are doing in their massive data centers. Those same people want to try to emulate that in their own data centers and often find that it just plain doesn’t work. It’s the same set of protocols. Why won’t this work for me?
Facebook is solving problems just like a service provider would. They are building not for continuous uptime, but instead for detectable failures that are quickly recoverable. If I told you that your data center was going to be down for ten minutes next month you’d probably be worried. If I told you that those outages were all going to be one minute long and occur ten times, you’d probably be much less worried. Service providers try to move around failure instead of pouring money into preventing it in the first place. That’s the whole reasoning behind Facebook’s “Fail Harder” mentality.
Failing Harder means making big mistakes and catching them before they become real problems. Little issues tend to get glossed over and forgotten about. Thing about something like Weighted Random Early Detection (WRED). WRED works because you can drop a few packets from a TCP session and it will keep chugging and request the missing bits. If you kill the entire connection or blow up a default gateway then you’ve got a real issue. WRED fixes a problem, global TCP synchronization, by failing quietly once in a while. And it works.
Instead of comparing your data center to Facebook or Google you should be taking a hard look at what you are actually trying to do. If you are doing Hadoop your data center is going to look radically different than a web services company. There are lessons you can learn from what the big boys are doing. Failing harder and using old tools in novel new ways are a good start your own data center analysis and planning. Just remember that those big data centers aren’t alien environments. They just have different needs to meet.
Here’s the entire SDDC Symposium Panel with Najam if you’d like to watch it. He’s got a lot of interesting insights into things besides what I wrote about above.
We’re not that big and we’re using an off the shelf appliance to accomplish similar goals. This is the box we’re using ( http://www.internap.com/business-internet-connectivity-services/route-optimization-flow-control/ ) but there are others out there. You can gain visibility via taps/spans for inbound and outbound traffic and adjust your internal and externally used routes accordingly. We’re still using bgp for route exchange (and I expect FB is too), we just internally adjust the “metric” for lack of a better term to steer traffic across paths that show they have less: latency, loss, jitter.
The rest of our DCs isn’t very FB like, but we also use WRED. 🙂
Enjoyed the post.
Pingback: Your Data Center Isn’t Facebook And That’s Just Fine
Pingback: Your Data Center Isn’t Facebook And That’s Just Fine | The Networking Nerd « Brain Dump
Pingback: Facebook Wedge 100 – The Future of the Data Center? | The Networking Nerd
Pingback: The Cargo Cult of Google Tools | The Networking Nerd
Pingback: Network BGP on TOR – Photos – Paris & Street
Pingback: Blog & Link for new techno @Work – Photos – Paris & Street
Pingback: Tomversations: Episode 12 - Hyperscale Networking - Gestalt IT