Unknown's avatar

About networkingnerd

Tom Hollingsworth, CCIE #29213, is a former network engineer and current organizer for Tech Field Day. Tom has been in the IT industry since 2002, and has been a nerd since he first drew breath.

Ten Years of Cisco Live – Community Matters Most of All

CLUS2016SignPic

Hey! I made the sign pic this year!

I’ve had a week to get over my Cisco Live hangover this year. I’ve been going to Cisco Live for ten years and been involved in the social community for five of them. And I couldn’t be prouder of what I’ve seen. As the picture above shows, the community is growing by leaps and bounds.

People Are What Matter

TomsCornerSelfie

I was asked many, many times about Tom’s Corner. What was it? Why was it important? Did you really start it? The real answer is that I’m a bit curious. I want to meet people. I want to talk to them and learn their stories. I want to understand what drives people to learn about networking or wireless or fax machines. Talking to a person is one of the best parts of my job, whether it be my Bruce Wayne day job or my Batman night job.

Social media helps us all stay in touch when we aren’t face-to-face, but meeting people in real life is as important too. You know who likes to hug. You find out who tells good stories. Little things matter like finding out how tall someone is in real life. You don’t get that unless you find a way to meet them in person.

FishHug

Hugging Denise Fishburne

Technology changes every day. We change from hardware to software and back again. Routers give way to switches. Fabrics rise. Analytics tell all. But all this technology still has people behind it. Those people make the difference. People learn and grow and change. They figure out how to make SDN work today after learning ISDN and Frame Relay yesterday. They have the power to expand beyond their station and be truly amazing.

Conferences Are Still King

Cisco Live is huge. Almost 30,000 attendees this year. The Mandalay Bay Convention Center was packed to the gills. The World of Solutions took up two entire halls this year. The number of folks coming to the event keeps going up every year. The networking world has turned this show into the biggest thing going on. Just like VMworld, it’s become synonymous with the industry.

People have a desire to learn. They want to know things. They want high quality introductions to content and deep dives into things they want to know inside and out. So long as those sessions are offered at conferences like Cisco Live and Interop people will continue to flock to them. For the shows that assemble content from the community this is an easy proposition. People are going to want to talk where others are willing to listen. For single sourced talks like Cisco Live, it’s very important to identify great speakers like Denise Fishburne (@DeniseFishburne) and Peter Jones (@PeterGJones) and find ways to get them involved. It’s also crucial to listen to feedback from attendees about what did work and what they want to see more of in the coming years.

Keeping The Community Growing

CLUS2016Tweetup

One thing that I’m most proud of is seeing the community grow and grow. I love seeing new faces come in and join the group. This year had people from many different social circles taking part in the Cisco Live community. Reddit’s /r/networking group was there. Kilted Monday happened. Engineering Deathmatches happened. Everywhere you looked, communities were doing great things.

As great as it was to see so many people coming together, it’s just as important to understand that we have to keep the momentum going. Networking doesn’t keep rolling along without new ideas and new people expressing them. Four years ago I could never have guessed the impact that Matt Oswalt (@Mierdin) and Jason Edelman (@JEdelman8) could have had on the networking community. They didn’t start out on top of the world. They fought their way up with new ideas and perspectives. The community adopted what they had to say and ran with it.

We need to keep that going. Not just at Cisco Live either. We need to identify the people doing great things and shining a spotlight on them. Thankfully, my day job affords me an opportunity to do just that. But the whole community needs to be doing it as well. If you can just find one person to tell the world about it’s a win for all of us. Convince a friend to write a blog post. Make a co-worker join Twitter. In the end every new voice is a chance for us all to learn something.


Tom’s Take

As Dennis Leary said in Demolition Man,

I’m no leader. I do what I have to do. Sometimes people come with me.

That’s what Cisco Live is to me. It’s not about a corner or a table or a suite at an event. It’s about people coming together to do things. People talking about work and having a good time. The last five years of Cisco Live have been some of the happiest of my life. More than any other event, I look forward to seeing the community and catching up with old friends. I am thankful to have a job that allows me to go to the event. I’m grateful for a community full of wonderful people that are some of the best and brightest at what they do. For me, Cisco Live is about each of you. The learning and access to Cisco is a huge benefit. But I would go for the people time and time and time again. Thanks for making the fifth year of this community something special to me.

Fixing The CCIE Written – A Follow Up

955951_28854808

I stirred up quite the hornet’s nest last week, didn’t I? I posted about how I thought the CCIE Routing and Switching Written Exam needed to be fixed. I got 75 favorites on Twitter and 40 retweets of my post, not to mention the countless people that shared it on a variety of forums and other sites. Since I was at Cisco Live, I had a lot of people coming up to me saying that they agreed with my views. I also had quite a few people that weren’t thrilled with my perspective. Thankfully, I had the chance to sit down with Yusuf Bhaiji, head of the CCIE program, and chat about things. I wanted to share some thoughts here.

Clarity Of Purpose

One of the biggest complaints that I’ve heard is that I was being “malicious” in my post with regards to the CCIE. I was also told that it was a case of “sour grapes” and even that the exam was as hard as it was on purpose because the CCIE is supposed to be hard. Mostly, I felt upset that people were under the impression that my post was designed to destroy, harm, or otherwise defame the CCIE in the eyes of the community. Let me state for the record what my position is:

I still believe the CCIE is the premier certification in networking. I’m happy to be a CCIE and love the program.

Why did I write the post? Not because I couldn’t pass the written. Not because I wanted people to tell me that I was wrong and being mean to them. I wrote the post because I saw a problem and wanted to address it. I felt that the comments being made by so many people that had recently taken the test needed to be collected and discussed. Sure, making light of these kinds of issues in a public forum won’t make people happy. But, as I said to the CCIE team, would you rather know about it or let it fester quietly?

Yusuf assured me that the CCIE program holds itself to the highest standards. All questions are evaluated by three subject matter experts (SMEs) for relevance and correctness before being included in the exam. If those three experts don’t sign off, the question doesn’t go in. There are also quite a few metrics built into the testing software that give the CCIE team feedback on questions and answer choices. Those programs can index all manner of statistics to figure out if questions are creating problems for candidates. Any given test can produce pages worth of valuable information for the people creating the test and trying to keep it relevant.

Another point that was brought up was the comment section on the exam. If you have any problem with a question, you need to fill out the comment form. Yes, I know that taking time out of the test to provide feedback can cause issues. It also interrupts your flow of answering questions. But if you even think for an instant that the question is unfair or misleading or incorrect, you have to leave a detailed comment to make sure the question is flagged properly for review. Which of the following comments means more to you?

  • Trivia question

or

  • This question tests on an obscure command and isn’t valid for a CCIE-level test.

I can promise I know which one is going to be evaluated more closely. And yes, every comment that has purpose is reviewed. The exam creators can print off every comment ever left on a question. The more detailed the comment, the more likely to trigger a review. So please make sure to leave a comment if you think there is a problem with the question.

Clarity Of Vision

Some of the conversations that I had during Cisco Live revolved around the relevance of the questions on the test to a CCIE candidate. Most of the people that I talked to were CCIEs already and using the test for recertification. A few came to me to talk about the relevance of the test questions to candidates that are qualifying for the lab.

While I’m not able to discuss any of the specific plans for the future of the program, I will say that there are ideas in place that could make this distinction matter less. Yusuf told me that the team will be releasing more details as soon as they are confirmed.

The most important point is that the issues that I have with the CCIE Written exam are fixable. I also believe that criticism without a suggestion solution is little more than whining. So I decided to put my money where my mouth is with regard to the CCIE written exam.

I volunteered to fix it.

I stepped up and offered my time as an SME to review the questions on the written exam for relevance, correctness, and grammar. That’s not a light undertaking. There are a ton of questions in the pool that need to be examined. So for every person that agreed with my post or told me that they thought the exam needed to be fixed, I’m putting you all on the spot as well.

It’s time for us as a community of CCIEs to do our part for the exam. Yusuf told me the easiest way to take part in the program is to visit the following URL:

http://www.cisco.com/go/certsme

Sign up for the SME program. Tell them that you want to help fix the CCIE. Maybe you only have to look at 5-10 questions. If the hundred or so people that agreed with me volunteered today, the entire test question pool could be analyzed in a matter of weeks. We could do our part to ensure that people taking the exam have the best possible test in front of them.

But I also challenge you to do more. Don’t just correct grammar or tell them they spelled “electricity” wrong in the question. Challenge them. Ask yourself if this is a question a CCIE candidate should know the answer to. There’s a chance that you could make a difference there. But you can’t do that unless you step up the plate.


Tom’s Take

I had at least ten people tell me that they would do whatever it took to fix the CCIE test last week after I talked to the CCIE cert team. They were excited and hopeful that the issues they saw with the test could be sorted out. I’ll admit that I stepped out on a pretty big limb here by doing this in public as opposed to over email or through official channels. And I do admit that I didn’t clarify my intent to build the program up as opposed to casting the whole exam team and process in a bad light.

Mea culpa.

But, my motivation succeed in getting people to talk about the CCIE written. There are many of you that are ready to do your part to help. Please, go sign up at the link above to join the SME program. Maybe you’ll never look at a single question, Maybe you’ll look at fifty. The point is that you step up and tell Cisco that you’re willing. If even fifteen people come forward and agree to help then that message will sound loud and clear that each and every one of us is proud of being a CCIE and want the program to continue long past the time when we’re retired and telling our grandchildren about the good old days of hard but fair tests.

If you have any questions about participating in the program or you want to reach out to me with your thoughts, don’t hesitate to contact me. Let’s put the power of community behind this!

The CCIE Routing And Switching Written Exam Needs To Be Fixed

CCIELogo

The former logo listed in this post was removed by request of Cisco

I’m having a great time at Cisco Live this year talking to networking professionals about the state of things. Most are optimistic about where their jobs are going to fit in with networking and software and the new way of doing things. But there is an undercurrent of dissatisfaction with one of the most fundamental pieces of network training in the world. The discontent is palpable. From what I’ve heard around Las Vegas this week, it’s time to fix the CCIE Written Exam.

Whadda Ya Know?!?

The CCIE written is the bellwether of network training. It’s a chance for network engineers that use Cisco gear to prove they have what it takes to complete a difficult regimen of training to connect networks of impressive size. It’s also a rite of passage to show others that you know how to study, prep, and complete a difficult practical examination without losing your cool. But all that hard work starts with a written test.

The CCIE written has always been a tough test. It’s the only barrier to entry to the CCIE lab. Because the CCIE has never had prerequisites and likely never will due to long standing tradition, the only thing standing in the way of you ability to sit the grueling lab test is a 100 question multiple choice exam that gauges your ability to understand networking at a deep technical level.

But within the last year or so, the latest version of the CCIE written exam has begun to get very bad reviews from all takers of the test. There are quite a few people that have talked about how bad the test is for candidates. Unlike a lot of “sour grapes” cases of people railing against a test they failed, the feedback for the CCIE written is entirely different. It tends to fall into a couple of categories:

The Test Is Poorly Written

The most resounding critique of the exam is that it is a poorly constructed and executed test. The question quality is subpar. There are spelling mistakes throughout and test questions that have poor answer selections. Having spent a large amount of time helping construct the CCNA exam years ago, I can tell you that you will spend the bulk of your time creating wrong answers as distractors to the right ones. Guidelines say that a candidate should have no better than a 25% chance to guess the correct answer from all the choices. If you’ve ever taken a math test that has four multiple choice answers with three being correct for various mistakes in working the problem, you know just how insidious proper distractors can be (and math teachers too).

The CCIE written is riddled with bad distractors according to reports. It also has questions that don’t have a true proper answer or a set of answers that are all technically correct with no way to select them all. That frustrates test takers and makes it very difficult to study for the exam. The editing and test mechanics errors must be rectified quickly in order to restore confidence to the people taking the test.

The Test Doesn’t Cover The Material

Once people stop telling me how bad the test is constructed, they start telling me that the questions are bad on a conceptual level as well. No NDAs are violated during these discussions to protect everyone involved, but the general opinion is that the test has skewed in the wrong direction. Cisco seems to be creating a test that focuses more on the Cisco and less on the Internetworking part of the CCIE.

The test has never been confused for being a vendor-neutral exam. Any look at the blueprint will tell you that there a plenty of proprietary protocols and implementation methods there. But the older versions of the exam did do a good job of teaching you how to build a network that could behave itself with other non-Cisco sections. Redistributing EIGRP and OSPF is a prime example. But the focus of the new exam seems to be skewed toward very specific Cisco proprietary protocols and the minutia around how they operate. I’ve always thought that knowing the hello and dead timers of OSPF NBMA areas is a huge time sink and really only justified for test takers, but I also see why knowing that would be important in multi-vendor operations. But knowing the same thing for an EIGRP DMVPN seems a bit pointless.

The other problem is that, by the admission of most test takers, the current CCIE Written Exam study guide doesn’t cover the areas of the blueprint that are potentially on the test. I feel very sorry for my friend Narbik Kocharians here. He worked very hard to create a study guide that would help test takers pass the exam with the knowledge necessary to do well on the lab. And having a test over a completely different area than his guide makes him look bad in the eyes of testers without good cause. It’s like a college class when the professor tells you to study the book but gives you a test over his or her lectures. It’s not fair because you studied what you were told and failed because they tested something else.

CCIEs Feel There Are Better Recert Options

This is the most damaging problem in my mind. About half the test takers for the CCIE written are candidates looking to qualify for the lab. That requires them to take the written exam for their specific track. But the other half of the test takers are CCIEs that have passed the lab and are looking to recertify. For these professionals, any CCIE written exam is valid for recertification.

Many CCIE candidates look to broaden their horizons by moving to different track to keep their CCIE current while they study for service provider, data center, or even collaboration as a topic area of study. For them, the CCIE is a stepping stone to keep the learning process going. But many CCIEs I’ve spoken to in the past few months are starting to take other exams not because they want to learn new things, but because the CCIE Routing and Switch written exam is such a terrible test.

Quite a few CCIEs are using the CCDE written to recertify. They feel it is a better overall test even though it doesn’t test the material to the level that the CCIE R&S written exam does. They would even be willing to take the chance of getting a question on an area of technology that they know nothing about to avoid having to deal with poor questions in their areas of study. Still more CCIEs are choosing to become Emeritus and “retire” so as to avoid the pain of the written exam. While this has implications for partner status and a host of other challenges for practicing engineers, you have to wonder how bad things must be to make retirement of your CCIE number look like a better option.


Tom’s Take

I took the CCIE R&S written last year at Cisco Live. I was so disgusted with the exam that I immediately switched to the CCDE written and recertified my number while simultaneously vowing never to take the R&S written again. From what I’ve heard this year, the test quality is still slipping with no relief in sight. It’s a sad state of affairs when you realize that the flagship test for Cisco engineers is so horribly broken that those same engineers believe it can’t be fixed. They feel that all the comments and feedback in the world are ignored and their expertise in taking exams is pushed aside for higher cut scores and a more exclusive number of candidates. The dark side of it all is the hope that there isn’t an agenda to push official training materials or other kinds of shortcuts that would help candidates while charging them more and/or locking out third party training providers that work hard to help people study for the lab.

Cisco needs to fix this problem now. They need to listen to feedback and get their written problems under control. If they don’t, they may soon find the only people taking the R&S written test are the same kinds of dumpers and cheaters they think they are trying to keep out with a poorly constructed test.

NOTE: I have published an update to this post here: Fixing The CCIE Written – A Follow Up

The Complexity Conundrum

NailPuzzle

Complexity is the enemy of understanding. Think about how much time you spend in your day trying to simplify things. Complexity is the reason why things like Reddit’s Explain Like I’m Five exist. We strive in our daily lives to find ways to simplify the way things are done. Well, except in networking.

Building On Shifting Sands

Networking hasn’t always been a super complex thing. Back when bridges tied together two sections of Ethernet, networking was fairly simple. We’ve spent years trying to make the network do bigger and better things faster with less input. Routing protocols have become more complicated. Network topologies grow and become harder to understand. Protocols do magical things with very little documentation beyond “Pure Freaking Magic”.

Part of this comes from applications. I’ve made my feelings on application development clear. Ivan Pepelnjak had some great comments on this post as well from Steve Chalmers and Derick Winkworth (@CloudToad). I especially like this one:

https://twitter.com/cloudtoad/status/750830744902127616

Derick is right. The application developers have forced us to make networking do more and more faster with less requirement for humans to do the work to meet crazy continuous improvement and continuous development goalposts. Networking, when built properly, is a static object like the electrical grid or a plumbing system. Application developers want it to move and change and breathe with their needs when they need to spin up 10,000 containers for three minutes to run a test or increase bandwidth 100x to support a rollout of a video streaming app or a sticker-based IM program designed to run during a sports championship.

We’ve risen to meet this challenge with what we’ve had to work with. In part, it’s because we don’t like being the scapegoat for every problem in the data center. We tire of sitting next to the storage admins and complaining about the breakneck pace of IT changes. We have embraced software enhancements and tried to find ways to automate, orchestrate, and accelerate. Which is great in theory. But in reality, we’re just covering over the problem.

Abstract Complexity

The solution to our software networking issues seems simple on the surface. Want to automate? Add a layer to abstract away the complexity. Want to build an orchestration system on top of that? Easy to do with another layer of abstraction to tie automation systems together. Want to make it all go faster? Abstract away!

“All problems in computer science can be solved with another layer of indirection.”

This is a quote from Butler Lampson often attributed to David Wheeler. It’s absolutely true. Developers, engineers, and systems builders keep adding layers of abstraction and indirection on top of complex system and proclaiming that everything is now easier because it looks simple. But what happens why the abstraction breaks down?

Automobiles are perfect example of this. Not too many years ago, automobiles were relatively simple things. Sure, internal combustion engines aren’t toys. But most mechanics could disassemble the engine and fix most issues with a wrench and some knowledge. Today’s cars have computers, diagnostics systems, and require lots of lots of dedicated tools to even diagnose the problem, let alone fix it. We’ve traded simplicity and ease of repairability the appearance of “simple” which conceals a huge amount of complexity under the surface.

To refer back to the Lampson/Wheeler quote, the completion of it is, “Except, of course, for the problem of too many indirections.” Even forty years ago it was understood that too many layers of abstraction would eventually lead to problems. We are quickly reaching this point in networking today. With all the reliance on complex tools providing an overwhelming amount of data about every point of the network, we find ourselves forced to use dashboards and data lakes to keep up with the rapid pace of changes dictated to the network by systems integrations being driven by developer desires and not sound network systems thinking.

Networking professionals can’t keep up. Just as other systems now must be maintained by algorithms to keep pace, so too does the network find itself being run by software instead of augmented by it. Even if people wanted to make a change they would be unable to do so because validating those changes manually would cause issues or interactions that could create havoc later on.

Simple Solutions

So how do we fix the issues? Can we just scrap it all and start over? Sadly, the answer here is a resounding “no”. We have to keep moving the network forward to match pace with the rest of IT. But we can do our part to cut down on the amount of complexity and abstraction being created in the process. Documentation is as critical as ever. Engineers and architects need to make sure to write down all the changes they make as well as their proposed designs for adding services and creating new features. Developers writing for the network need to document their APIs and their programs liberally so that troubleshooting and extension are easily accomplished instead of just guessing about what something is or isn’t supposed to be doing.

When the time comes to build something new, instead of trying to plaster over it with an abstraction, we need to break things down into their basic components and understand what we’re trying to accomplish. We need to augment existing systems instead of building new ones on top of the old to make things look easy. When we can extend existing ideas or augment them in such as way as to coexist then we can worry less about hiding problems and more about solving them.


Tom’s Take

Abstraction has a place, just like NAT. It’s when things spiral out of control and hide the very problems we’re trying to fix that it becomes an abomination. Rather than piling things on the top of the issue and trying to hide it away until the inevitable day when everything comes crashing down, we should instead do the opposite. Don’t hide it, expose it instead. Understand the complexity and solve the problem with simplicity. Yes, the solution itself may require some hard thinking and some pretty elegant programming. But in the end that means that you will really understand things and solve the complexity conundrum.

Will Dell Networking Wither Away?

chopping-block-Dell-EMC

The behemoth merger of Dell and EMC is nearing conclusion. The first week of August is the target date for the final wrap up of all the financial and legal parts of the acquisition. After that is done, the long task of analyzing product lines and finding a way to reduce complexity and product sprawl begins. We’ve already seen the spin out of Quest and Sonicwall into a separate entity to raise cash for the final stretch of the acquisition. No doubt other storage and compute products are going to face a go/no go decision in the future. But one product line which is in real danger of disappearing is networking.

Whither Whitebox?

The first indicator of the problems with Dell and networking comes from whitebox switching. Dell released OS 10 earlier this year as a way to capitalize on the growing market of free operating systems running on commodity hardware. Right now, OS 10 can run on Dell equipment. In the future, they are hoping to spread it out to whitebox devices. That assumes that soon you’ll see Dell branded OSes running on switches purchased from non-Dell sources booting with ONIE.

Once OS 10 pushes forward, what does that mean for Dell’s hardware business? Dell would naturally want to keep selling devices to customers. Whitebox switches would undercut their ability to offer cheap ports to customers in data center deployments. Rather than give up that opportunity, Dell is positioning themselves to run some form of Dell software on top of that hardware for management purposes, which has always been a strong point for Dell. Losing the hardware means little to Dell if they have to lose profit margin to keep it there in the first place.

The second indicator of networking issues comes from comments from Michael Dell at EMCworld this year. Check out this short video featuring him with outgoing EMC CEO Joe Tucci:

Some of the telling comments in here involve Michael Dell’s praise for the NSX business model and how it is being adopted by a large number of other vendors in the industry. Also telling is their reaffirmation that Cisco is an important partnership in VCE and won’t be going away any time soon. While these two things don’t seem to be related on the surface, they both point to a truth Dell is trying hard to accept.

In the future, with overlay network virtualization models gaining traction in the data center, the underlying hardware will matter little. In almost every case, the hardware choice will come down to one of two options:

  1. Which switch is the cheapest?
  2. Which switch is on the Approved List?

That’s it. That’s the whole decision tree. No one will care what sticker is on the box. They will only care that it didn’t cost a fortune and that they won’t get fired for buying it. That’s bad for companies that aren’t making white boxes or named Cisco. Other network vendors are going to try and add value in some way, but the overlay sitting on top of those bells and whistles will make it next to impossible to differentiate in anything but software. Whether that’s superior management capabilities, open plug-in model, or some other thing we haven’t thought of will make no difference in the end. Software will still be king and the hardware will be an inexpensive pawn or a costly piece that has been pre-approved.

Whither Wireless?

The other big inflection point that makes me worry about the Dell networking story is the lack of movement in the wireless space. Dell has historically been a company to partner first and acquire second. But with HPE’s acquisition of Aruba Networks last year, the dominos in the wireless space are still waiting to fall. Brocade raced out to buy Ruckus. Meru offered itself on a platter to anyone that would buy them. Now Aerohive stands as the last independent wireless vendor without a dance partner. Yes, they’ve announced that they are partnering with Dell, but have you been to the Dell Wireless Networking page? Can you guess what the Dell W-series is? Here’s a hint: it rhymes with “Peruba”.

Every time Dell leads with a W-series deployment, they are effectively paying their biggest competitor. They are opening the door to allowing HPE/Aruba to come in and not only start talking about wireless but servers, storage, and other networking as well. Dell would do well at this point to start deemphasizing the W-series and start highlighting the “new generation” of Aerohive APs and how they are going to the be the focus moving forward.

The real solution would be for Dell to buy a wireless company and take all the wireless expertise they are selling in-house. That would show they are serious about both the campus network of the future and the data center network needed to support their other server and storage infrastructure. Sadly, with Dell being leveraged due to the privatization of his company just two years ago and mounting debt for this mega merger, Dell is looking to make cash with spin offs instead of spending it on yet another company to ingest and subsume. Which means a real non-partner wireless solution is still many years away.


Tom’s Take

Dell’s networking strategy is in maintenance mode. Make switches to support faster speeds for now, probably with Tomahawk support soon, and hope that this whole networking thing goes software sooner rather than later. Otherwise, the need to shore up the campus wireless areas along with the coming decision about showing support fully behind NSX and partnerships is going to be a bitter pill to swallow. Perhaps Dell Networking will exist as an option for companies wanting a 100% Dell solution? Or maybe they are waiting for a new offering from Dell/EMC in the data center to drive profits to research and development to keep pace with Cisco and Arista? One can only hope that their networking flower doesn’t wither on the vine.

DockerCon Thoughts – Secure, Sufficient Applications

containerssuspended

I got to spend a couple of days this week at DockerCon and learn a bit more about software containers. I’d always assumed that containers were a slightly different form of virtualization, but thankfully I’ve learned my lesson there. What I did find out about containers gives me a bit of hope about the future of applications and security.

Minimum Viable App

One of the things that made me excited about Docker is that the process isolation idea behind building a container to do one thing has fascinating ramifications for application developers. In the past, we’ve spent out time building servers to do things. We build hardware, boot it with an operating system, and then we install the applications or the components thereof. When we started to virtualize hardware into VMs, the natural progression was to take the hardware resource and turn it into a VM. Thanks to tools that would migrate a physical resource to a virtual one in a single step, most of the first generation VMs were just physical copies of servers. Right down to phantom drivers in the Windows Device Manager.

As we started building infrastructure around the idea of virtualization, we stopped migrating physical boxes and started building completely virtual systems from the ground up. That meant using things like deployment templates, linked clones, and other constructs that couldn’t be done in hardware alone. As time has rolled on, we have a method of quickly deploying virtual resources that we could never do on purely physical devices. We finally figured out how to use virtual platforms efficiently.

Containers are now at the crossroads we saw early on in virtualization. As explained by Mike Coleman (@MikeGColeman), many application developers are starting their container journey by taking an existing app and importing it directly into a container. It’s a bit more involved than the preferred method, but Mike mentioned that even running the entire resource pool in a container does have some advantages. I’m sure the Docker people see container adoption as the first step toward increased market share. Even if it’s a bit clumsy at the start.

The idea then moves toward the breakdown of containers into the necessary pieces, much as it did with virtual machines years ago. Instead of being forced to think about software as a monolithic construct that has to live on a minimum of one operating system, developers can break the system down into application pieces that can execute one program or thread at a time on a container. Applications can be built using the minimum amount of software constructs needed for an individual process. That means that those processes can be spread out and scaled up or down as needed to accomplish goals.

If your database query function is running as a containerized process instead of running on a query platform in a VM then scaling that query to thousands or tens of thousands of instances only requires spinning up new containers instead of new VMs. Likewise, scaling a web-based app to accommodate new users can be accomplished with an explosion of new containers to meet the need. And when the demand dies back down again, the containers can be destroyed and resources returned to the available pool or turned off to save costs.

Segment Isolation

The other exciting thing I saw with containers was the opportunity for security. The new buzzword of the day in security and networking is microsegmentation. VMware is selling it heavily with NSX. Cisco has countered with a similar function in ACI. At the heart of things microsegmentation is simply ensuring that processes that shouldn’t be talking to each other won’t be talking to each other. This prevents exposure by having your app database server visible on the public Internet, for instance.

Microsegmentation is great in overlay and network virtualization systems where we have to take steps to prevent systems from talking to each other. That means policies and safeguards in place to prevent communications. It’s a great way to work on existing networks where the default mode is to let everything on the same subnet talk to everything else. But what if the default was something different.

With containers, there is a sandbox environment for each container to talk to other containers in the same area. If you create a named container network and attach a container to it, that container gains a network interface on that particular named network. It won’t be able to talk to other containers on different networks without creating an explicit connection between the two networks. That means that the default mode of communications for the containers is restricted out of the box.

Imagine how nice it will be to create a network that isn’t insecure by default. Rather than having to unconnected all the things that shouldn’t speak, you can spend your time building connections between the things that should be speaking. That means a little mistake or forgotten connection will prevent communications instead of opening it up. That means much less likelihood that you’re going to cause an incident.

There are still some issues with scaling the networking aspect of Docker right now. The key/value store doesn’t provide a lot of visibility and definitely won’t scale up to tens or hundreds of thousands of connections. My hope is that down the road Docker will implement a more visible solution that can perform drag-and-drop connectivity between containers and leave an audit trail so networking pros can figure out who connected what and how that exposed everything. It also makes it much easier when the connection between the devices has to be explicit to prove intent or malice. But those features are likely to come down the road as Docker builds a bigger, better management platform.


Tom’s Take

I think Docker is doing things right. By making developers look at the core pieces they need to build apps and justify why things are being done the way they’ve always been done, containers are allowing for flexibility and new choices to be made. At the same time, those choices are inherently more secure because resources are only shared when necessary. It’s the natural outgrowth of sandboxing and Jails in the OS from so many years ago. Docker has a chance to make application developers better without making them carry the baggage of years of thinking along with them to a new technology.

Running Barefoot – Thoughts on Tofino and P4

barefootgrass

The big announcement this week is that Barefoot Networks leaped out of stealth mode and announced that they’re working on a very, very fast datacenter switch. The Barefoot Tofino can do up to 6.5 Tbps of throughput. That’s a pretty significant number. But what sets the Tofino apart is that it also uses the open source P4 programming language to configure the device for everything, from forwarding packets to making routing decisions. Here’s why that may be bigger than another fast switch.

Feature Presentation

Barefoot admits in their announcement post that one of the ways they were able to drive the performance of the Tofino platform higher was to remove a lot of the accumulated cruft that has been added to switch software for the past twenty years. For Barefoot, this is mostly about pushing P4 as the software component of their switch platform and driving adoption of it in a wider market.

Let’s take a look at what this really means for you. Modern network operating systems typically fall into one of two categories. The first is the “kitchen sink” system. This OS has every possible feature you could ever want built in at runtime. Sure, you get all the packet forwarding and routing features you need. But you also carry the legacy of frame relay, private VLANs, Spanning Tree, and a host of other things that were good ideas at one time and now mean little to nothing to you.

Worse yet, kitchen sink OSes require you to upgrade in big leaps to get singular features that you need but carry a whole bunch of others you don’t want. Need routing between SVIs? That’s an Advanced Services license. Sure, you get BGP with that license too, but will you ever use that in a wiring closet? Probably not. Too bad though, because it’s built into the system image and can’t be removed. Even newer operating systems like NX-OS have the same kitchen sink inclusion mentality. The feature may not be present at boot time, but a simple command turns it on. The code is still baked into the kernel, it’s just loaded as a module instead.

On the opposite end of the scale, you have newer operating systems like OpenSwitch. The idea behind OpenSwitch is to have a purpose built system that does a few things really, really well. OpenSwitch can build a datacenter fabric very quickly and make it perform well. But if you’re looking for additional features outside of that narrow set, you’re going to be out of luck. Sure, that means you don’t need a whole bunch of useless features. But what about things like OSPF or Spanning Tree? If you decide later that you’d like to have them, you either need to put in a request to have it built into the system or hope that someone else did and that the software will soon be released to you.

We Can Rebuild It

Barefoot is taking a different track with P4. Instead of delivering the entire OS for you in one binary image, they are allowing you to build the minimum number of pieces that you need to make it work for your applications. Unlike OpenSwitch, you don’t have to wait for other developers to build in a function that you need in order to deploy things. You drop to an IDE and write the code you need to forward packets in a specific way.

There are probably some people reading this post that are nodding their heads in agreement right now about this development process. That’s good for Barefoot. That means that their target audience wants functionality like this. But Barefoot isn’t for everyone. The small and medium enterprise isn’t going to jump at the chance to spend even more time programming forwarding engines into their switches. Sure, the performance profile is off the chart. But it’s also a bit like buying a pricy supercar to drive back and forth to the post office. Overkill for 98% of your needs.

Barefoot is going to do well in financial markets where speed is very important. They’re also going to sell into big development shops where the network team needs pared-down performance in software and a forwarding chip that can blow the doors off the rest of the network for East <-> West traffic flow. Give that we haven’t seen a price tag on Tofino just yet, I would imagine that it’s priced well into those markets and beyond the reach of a shop that just needs two leaf nodes and a spine to connect them. But that’s exactly what needs to happen.


Tom’s Take

Barefoot isn’t going to appeal to shops that plug in a power cable and run a command to provision a switch. Barefoot will shine where people can write code that will push a switch to peak performance and do amazing things. Perhaps Barefoot will start offering code later on that gives you the ability to program basic packet forwarding into a switch or routing functions when needed without the requirement of taking hours of classes on P4. But for the initial release, keeping Tofino in the hands of dev shops is a great idea. If for no other reason than to cut down on support costs.

The 25GbE Datacenter Pipeline

pipeline

SDN may have made networking more exciting thanks to making hardware less important than it has been in the past, but that’s not to say that hardware isn’t important at all. The certainty with which new hardware will come out and make things a little bit faster than before is right there with death and taxes. One of the big announcements yesterday from Hewlett Packard Enterprise (HPE) during HPE Discover was support for a new 25GbE / 100GbE switch architecture built around the FlexFabric 5950 and 12900 products. This may be the tipping point for things.

The Speeds of the Many

I haven’t always been high on 25GbE. Almost two years ago I couldn’t see the point. Things haven’t gotten much different in the last 24 months from a speed perspective. So why the change now? What make this 25GbE offering any different than things from the nascent ideas presented by Arista?

First and foremost, the 25GbE released by HPE this week is based on the Broadcom Tomahawk chipset. When 25GbE was first presented, it was a collection of vendors trying to convince you to upgrade to their slightly faster Ethernet. But in the past two years, most of the merchant offerings on the market have coalesced around using Broadcom as the primary chipset. That means that odds are good your favorite switching platform is running Trident 2 or Trident 2+ under the hood.

With Broadcom backing the silicon, that means wider adoption of the specification. Why would anyone buy 25GbE from Brocade or Dell or HPE if the only vendor supporting it was that vendor of choice? If you can’t ever be certain that you’ll have support for the hardware in three or five years time, making an investment today seems silly. Broadcom’s backing means that eventually everyone will be adopting 25GbE.

Likewise, one of my other impediments to adoption was the lack of server NICs to ramp hosts to 25GbE. Having fast access ports means nothing if the severs can’t take advantage of them. HPE addressed this with the release of FlexFabric networking adapters that can run 25GbE Ethernet. More importantly, those adapters (and switches) can run at 10GbE as well. This means that adoption of higher bandwidth is no longer an all-or-nothing proposition. You don’t have to abandon your existing investment to get to 25GbE right away. You don’t have to build a lab pod to test things and then sneak it into production. You can just buy a 5950 today and clock the ports down to 10GbE while you await the availability and purchasing cycle to buy 25GbE NICs. Then you can flip some switches in the next maintenance window and be running at 25GbE speeds. And you can leave some ports enabled at 10GbE to ensure that there is maximum backwards compatibility.

The Needs of the Few

Odds are good that 25GbE isn’t going to be right for you today. HPE is even telling people that 25GbE only really makes sense in a few deployment scenarios, among which are large container-based hosts running thousands of virtual apps, flash storage arrays that use Ethernet as a backplane, or specialized high-performance computing (HPC) tricks with RDMA and such. That means the odds are good that you won’t need 25GbE first thing tomorrow morning.

However, the need for 25GbE is going to be there. As applications grow more bandwidth hungry and data centers keep shrinking in footprint, the network hardware you do have left needs to work harder and faster to accomplish more with less. If the network really is destined to become a faceless underlay that serves as a utility for applications, it needs to run flat out fast to ensure that developers can’t start blaming their utility company for problems. Multi-core server architectures and flash storage have solved two of the three legs of this problem. 25GbE host connectivity and the 100GbE backbone connectivity tied to it, solve the other side of the equation so everything balances properly.

Don’t look at 25GbE as an immediate panacea for your problems. Instead, put it on a timeline with your other server needs and see what the adoption rate looks like going forward. If server NICs are bought in large quantities, that will drive manufactures to push the technology onto the server boards. If there is enough need for connectivity at these speeds the switch vendors will start larger adoption of Tomahawk chipsets. That cycle will push things forward much faster than the 10GbE / 40GbE marathon that’s been going on for the past six years.


Tom’s Take

I think HPE is taking a big leap with 25GbE. Until the Dell/EMC merger is completed they won’t find themselves in a position to adopt Tomahawk quickly in the Force10 line. That means the need to grab 25GbE server NICs won’t materialize if there’s nothing to connect them. Cisco won’t care either way so long as switches are purchased and all other networking vendors don’t sell servers. So that leaves HPE to either push this forward to fall off the edge of the cliff. Time will tell how this will all work out, but it would be nice to see HPE get a win here and make the network the least of application developer problems.

Disclaimer

I was a guest of Hewlett Packard Enterprise for HPE Discover 2016. They paid for my travel, hotel, and meals during the event. While I was briefed on the solution discussed here and many others, there was no expectation of coverage of the topics discussed. HPE did not ask for, nor were they guaranteed any consideration in the writing of this article. The conclusions and analysis contained herein are mine and mine alone.

Flash Needs a Highway

CarLights

Last week at Storage Field Day 10, I got a chance to see Pure Storage and their new FlashBlade product. Storage is an interesting creature, especially now that we’ve got flash memory technology changing the way we think about high performance. Flash transformed the industry from slow spinning gyroscopes of rust into a flat out drag race to see who could provide enough input/output operations per second (IOPS) to get to the moon and back.

Take a look at this video about the hardware architecture behind FlashBlade:

It’s pretty impressive. Very fast flash storage on blades that can outrun just about anything on the market. But this post isn’t really about storage. It’s about transport.

Life Is A Network Highway

Look at the backplane of the FlashBlade chassis. It’s not something custom or even typical for a unit like that. The key is when the presenter says that the architecture of the unit is more like a blade server chassis than a traditional SAN. In essence, Pure has taken the concept of a midplane and used it very effectively here. But their choice of midplane is interesting in this case.

Pure is using the Broadcom Trident II switch as their networking midplane for FlashBlade. That’s pretty telling from a hardware perspective. Trident II runs a large majority of switches in the market today that are merchant silicon based. They are essentially becoming the Intel of the switch market. They are supplying arms to everyone that wants to build something quickly at low cost without doing any kind of custom research and development of their own silicon manufacturing.

Using a Trident II in the backplane of the FlashBlade means that Pure evaluated all the alternatives and found that putting something merchant-based in the midplane is cost effective and provides the performance profile necessary to meet the needs of flash storage. Saturating backplanes with IOPS can be accomplished. But as we learned from Coho Data, it takes a lot of CPU horsepower to create a flash storage system that can saturate 10Gig Ethernet links.

I Am Speed

Using Trident II as a midplane or backplane for devices like this has huge implications. First and foremost, networking technology has a proven track record. If Trident II wasn’t a stable and reliable platform, no one would have used it in their products. And given that almost everyone in the networking space has a Trident platform for sale, it speaks volumes about reliability.

Second, Trident II is available. Broadcom is throwing these units off the assembly line as fast as they can. That means that there’s no worry about silicon shortages or plant shutdowns or any one of a number of things that can affect custom manufacturing. Even if a company wants to look at a custom fabrication, it could take months or even years to bring things online. By going with a reference design like Trident II, you can have your software engineers doing the hard work of building a system to support your hardware. That speeds time to market.

Third, Trident is a modular platform. That part can’t be understated even though I think it wasn’t called out very much in the presentation from Pure. By having a midplane that is built as a removable module, it’s very easy to replace it should problems arise. That’s the point of field replaceable units (FRUs). But in today’s market, it’s just as easy to create a system that can run multiple different platforms as well. The blade chassis idea extends equally to both blades and mid or backplanes.

Imagine being able to order a Tomahawk-based controller unit for FlashBlade that only requires you to swap the units at the back of the system. Now, that investment in 10Gig blade connectivity with 40Gig uplinks just became 25Gig blade connectivity with 100Gig uplinks to the network. All for the cost of two network controller blades. There may be some software that needs to be written to make the transition smooth for the consumers in the system, but the hardware is more than capable of supporting a change like that.


Tom’s Take

I was thrilled to see Pure Storage building a storage platform that tightly integrates with networking the way that FlashBlade does. This is how the networking stack is going to be completely integrated with storage and compute. We should still look at things through the lens of APIs and programmability, but making networking and consistent transport layer for all things in the datacenter is a good start.

The funny thing about making something a consistent transport layer is that by design it has to be extensible. That means more and more folks are going to be driving those pieces into the network. Software can be created on top of this common transport to differentiate, much like we’re seeing with network operating systems right now. Even Pure was able to create a different kind of transport protocol to do the heavy lifting at low latency.

It’s funny that it took a presentation from a storage company to make me see the value of the network as something agnostic. Perhaps I just needed some perspective from the other side of the fence.

Pushing Everyone’s Buttons In IT

HistoryEraserButton

We have officially reached the point in our long and storied IT careers where we, as old fogies, have earned the right to complain about the next generation of users and professionals. Just as the gray beards before us complained about the way we did things, so too is it our turn to moan about the state of affairs. Today, I’d like to point out how driving IT to the point of pushing simple buttons is destroying the way we do things.

Easy Buttons

The fact that IT work has been able to be distilled into a series of simple button pushing exercises is very thrilling. We’ve spent a lot of time and effort building devices and frameworks that take the hard part out of building devices and frameworks. We no longer have to invent languages to build things or hardware to do things. Instead, we can refine our programming capabilities or use general purpose hardware in new combinations to provide environments for our users.

That’s one of the things that is driving people to the cloud. Cloud isn’t just about exciting hardware or keeping your data in other places. It is just as much about predictable, repeatable frameworks and workflows that let people accomplish tasks without much thought. Just like the assembly line of the eighties, we are take boring repetitive tasks away from people and giving them to machines that don’t get bored and love repetition. Doing that drives costs down and makes people more productive.

But what happens when those processes break down? What happens when the magic smoke is let out of the machine? That’s when the real issues start cropping up. Perhaps the framework developers are great at figuring out what exactly went wrong in their solution. However, they’re likely clueless about all of the things that their solution is built upon. Imagine if a mechanic couldn’t diagnose the engine or the various subsystems of your car? You’d have a fit, right?

The troubleshooting level of modern framework engineering only extends to the edge of their solution. Once you get into a problem with a service they’ve leveraged, such as AWS, then the buck is passed and the troubleshooting must move to a new team. The problem with having your teams building next-generation IT services on top of existing things is that they lose visibility into the things they didn’t build. It becomes a vicious cycle when those first-level services aren’t reliable enough to keep your solution running.

Push A Button, Push A Button!

As bad as things are for IT departments in a push-button world, it’s even crazier for users in the same boat. That’s because users understand two extremes of service delivery. Either the thing is working or it isn’t. There is no in between. In the old days of non-instant IT, users would just keep asking until a thing was done.

Today, things are much different. Automation and orchestration has allowed frameworks and platforms to be able to instantly create services. That means users have been able to create their own platform for building things. So anything that isn’t instant is broken. Take, for example, a recent discussion of bandwidth from the spring ONUG meeting. An IT professional was frustrated that it took the better part of a day to transfer 1 petabyte of information across the country. He wasn’t upset that it failed, mind you. He was mad because it didn’t happen in five minutes. Non-instant things are broken.

The story repeats itself over and over again. Networking resources that can’t be automatically provisioned are broken. Cloud services that need more than a few minutes to spin up aren’t working correctly. Mobile apps and sites that don’t instantly pop up on your phone must be working incorrectly. The patience level of a user isn’t even a tenth of the average IT professional. IT pros know why something is running slow. Users fall back on slow things being broken.


Tom’s Take

The problem is investment. Given an infinite amount of funding, everything can be fast. But people are trying to find ways to not pay for things in today’s IT world. Maybe they want to pay for AWS because it works. But OpenStack gives the promise of AWS for free. Except things like OpenStack and Ansible aren’t free. The currency you use to pay for them is time.

Time is just as precious a commodity as money. We don’t get refunds on time. We can’t ask for discounts on time. We can only invest and hope that there is a payoff down the road. The real outcome of this time investment looks very similar to the environment we have today. Things should just work and allow us to build new things on top of them. But that missing time investment is the key to the whole enterprise. If we don’t spend the time building and extending things, then we won’t have the expertise we need to fix things when they break. That time investment also helps everyone appreciate just how hard it is to build things in the first place. And that’s something no button will every duplicate.