Is Bandwidth A Precious Resource?

During a recent episode of the Packet Pushers Podcast, Greg and Drew talked about the fact that bandwidth just keeps increasing and we live in a world where the solution to most problems is to just increase the pipeline to the data center or to the Internet. I came into networking after the heady days of ISDN lines everywhere and trying to do traffic shaping on slow frame relay links. But I also believe that we’re going to quickly find ourselves in a pickle when it comes to bandwidth.

Too Depressing

My grandparents were alive during the Great Depression. They remember what it was like to have to struggle to find food or make ends meet. That one singular experience transformed the way they lived their lives. If you have a relative or know of someone that lived through that time, you probably have noticed they have some interesting habits. They may keep lots of cash on hand stored in various places around the house. They may do things like peel labels from jelly jars and use them as cups. They may even go to great lengths to preserve as much as they can for reuse later “just in case”.

It’s not uncommon for this to happen in the IT world as well. People that have been marked by crazy circumstance develop defense mechanisms against. Maybe it’s making a second commit to a configuration to ensure it’s correct before being deployed. Maybe it’s always taking a text backup of a switchport before shutting it down in case an old bug wipes it clean. Or, as it relates to the topic above, maybe it’s a network engineer that grew up on slow ISDN circuits trying to optimize links for traffic when there is absolutely no need to do so.

People will work with what they’re familiar with. If they treat every link as slow and prone to congestion they’ll configure they QoS policies and other shaping features like they were necessary to keep a 128k link alive with a massive traffic load. Even if the link has so much bandwidth that it will never even trigger the congestion features.

Rational Exuberance

The flip side of the Great Depression grandparent is the relative that grew up during a time when everything was perfect and there was nothing to worry about. The term coined by Alan Greenspan to define this phenomenon was Irrational Exuberance, which is the idea that the stock market of 1996 was overvalued. It also has the connotation of meaning that people will believe that everything is perfectly fine and dandy when all is well right up to the point when the rug gets pulled out from underneath them.

Going back to our bandwidth example, think about a network engineer that has only ever known a world like we have today where bandwidth is plentiful and easily available. I can remember installing phone systems for a school that had gigabit fiber connectivity between sites. QoS policies were non-existent for them because they weren’t needed. When you have all the pipeline you can use you don’t worry about restraining yourself. You have a plethora of bandwidth capabilities.

However, you also have an issue with budgeting. Turns out that there’s no such thing as truly unlimited bandwidth. You’re always going to hit a cap somewhere. It could be the uplink from the server to the switch. Maybe it’s the uplink between switches on the leaf-spine fabric. It could even be the WAN circuit that connects you to the Internet and the public cloud. You’re going to hit a roadblock somewhere. And if you haven’t planned for that you’re going to be in trouble. Because you’re going to realize that you should have been planning for the day when your options ran out.

Building For Today

If you’re looking at a modern enterprise, you need to understand some truths.

  1. Bandwidth is plentiful. Until it isn’t. You can always buy bigger switches. Run more fiber. Create cross-connects to increase bandwidth between east-west traffic. But once you hit the wall of running out of switch ports or places to pull fiber you’re going to be done no matter what.
  2. No Matter How Much You Have, It Won’t Be Enough. I learned this one at IBM back in 2001. We had a T3 that ran the entire campus in Minnesota. They were starting to get constrained on bandwidth so they paid a ridiculous amount of money to have another one installed to increase the bandwidth for users. We saturated it in just a couple of months. No matter how big the circuit or how many you install, you’re eventually going to run out of room. And if you don’t plan for that you’re going to be in a world of trouble.
  3. Plan For A Rainy Day. If you read the above, you know you’re going to need to have a plan in place when the day comes that you run out of unlimited bandwidth. You need to have QoS policies ready to go. Application inspection engines can be deployed in monitor mode and ready at a moment’s notice to be enabled in hopes of restricting usage and prioritizing important traffic. Remember that QoS doesn’t magically create bandwidth from nothing. Instead, it optimizes what you have and ensures that it can be used properly by the applications that need it. So you have to know what’s critical and what can be left to best effort. That means you have to do the groundwork ahead of time so there are no surprises. You have to be vigilant too. Who would have expected last year that video conference traffic would be as important as it is today?

Tom’s Take

Bandwidth is just like any other resource. It’s not infinite. It’s not unlimited. It only appears that way because we haven’t found a way to fill up that pipe yet. For every protocol that tries to be a good steward and not waste bandwidth like OSPF, you have a newer protocol like Open/R that has never known the harsh tundra of an ISDN line. We can make the bandwidth look effectively limitless but only by virtue of putting smart limits in place early and understanding how to make things work more smoothly when the time comes. Bandwidth is precious and you can make it work for you with the right outlook.

QoS Is Dead. Long Live QoS!

Ah, good old Quality of Service. How often have we spent our time as networking professionals trying to discern the archaic texts of Szigeti to learn how to make you work? QoS is something that seemed so necessary to our networks years ago that we would spend hours upon hours trying to learn the best way to implement it for voice or bulk data traffic or some other reason. That was, until a funny thing happened. Until QoS was useless to us.

Rest In Peace and Queues

QoS didn’t die overnight. It didn’t wake up one morning without a home to go to. Instead, we slowly devalued and destroyed it over a period of years. We did it be focusing on the things that QoS was made for and then marginalizing them. Remember voice traffic?

We spent years installing voice over IP (VoIP) systems in our networks. And each of those systems needed QoS to function. We took our expertise in the arcane arts of queuing and applied it to the most finicky protocols we could find. And it worked. Our mystic knowledge made voice better! Our calls wouldn’t drop. Our packets arrived when they should. And the world was a happy place.

That is, until voice became pointless. When people started using mobile devices more and more instead of their desk phones, QoS wasn’t as important. When the steady generation of delay-sensitive packets instead moved back to LTE instead of IP it wasn’t as critical to ensure that FTP and other protocols in the LAN interfered with it. Even when people started using QoS on their mobile devices the marking was totally inconsistent. George Stefanick (@WirelesssGuru) found that Wi-Fi calling was doing some weird packet marking anyway:

So, without a huge packet generation issue, QoS was relegated to some weird traffic shaping roles. Maybe it was video prioritization in places where people cared about video? Or perhaps it was creating a scavenger class for traffic in order to get rid of unwanted applications like BitTorrent. But overall QoS languished as an oddity as more and more enterprises saw their collaboration traffic moving to be dominated by mobile devices that didn’t need the old dark magic of QoS.

QoupS de Gras

The real end of QoS came about thanks to the cloud. While we spent all of our time trying to find ways to optimize applications running on our local enterprise networks, developers were busy optimizing applications to run somewhere else. The ideas were sound enough in principle. By moving applications to the cloud we could continually improve them and push features faster. By having all the bit off the local network we could scale massively. We could even collaborate together in real time from anywhere in the world!

But applications that live in the cloud live outside our control. QoS was always bounded by the borders of our own networks. Once a packet was launched into the great beyond of the Internet we couldn’t control what happened to it. ISPs weren’t bound to honor our packet markings without an SLA. In fact, in most cases the ISP would remark all our packets anyway just to ensure they didn’t mess with the ISP’s ideas of traffic shaping. And even those were rudimentary at best given how well QoS plays with MPLS in the real world.

But cloud-based applications don’t worry about quality of service. They scale as large as you want. And nothing short of a massive cloud outage will make them unavailable. Sure, there may be some slowness here and there but that’s nothing less than you’d expect to receive running a heavy application over your local LAN. The real genius of the cloud shift is that it forced developers to slim down applications and make them more responsive in places where they could be made to be more interactive. Now, applications felt snappier when they ran in remote locations. And if you’ve every tried to use old versions of Outlook across slow links you now how critical that responsiveness can be.

The End is The Beginning

So, with cloud-based applications here to stay and collaboration all about mobile apps now, we can finally carve the tombstone for QoS right? Well, not quite.

As it turns out, we are still using lots and lots of QoS today in SD-WAN networks. We’re just not calling it that. Instead, we’ve upgraded the term to something more snappy, like “Application Visibility”. Under the hood, it’s not much different than the QoS that we’ve done for years. We’re still picking out the applications and figuring out how to optimize their traffic patterns to make them more responsive.

The key with the new wave of SD-WAN is that we’re marrying QoS to conditional routing. Now, instead of being at the mercy of the ISP link to the Internet we can do something else. We can push bulk traffic across slow cheap links and ensure that our critical business applications have all the space they want on the fast expensive ones instead. We can push our out-of-band traffic out of an attached 4G/LTE modem. We can even push our traffic across the Internet to a gateway closer to the SaaS provider with better performance. That last bit is an especially delicious piece of irony, since it basically serves the same purpose as Tail-end Hop Off did back in the voice days.

And how does all this magical new QoS work on the Internet outside our control? That’s the real magic. It’s all tunnels! Yes, in order to make sure that we get our traffic where it needs to be in SD-WAN we simply prioritize it going out of the router and wrap it all in a tunnel to the next device. Everything moves along the Internet and the hop-by-hop treatment really doesn’t care in the long run. We’re instead optimizing transit through our network based on other factors besides DSCP markings. Sure, when the traffic arrives on the other side it can be optimized based on those values. However, in the real world the only thing that most users really care about is how fast they can get their application to perform on their local machine. And if SD-WAN can point them to the fastest SaaS gateway, they’ll be happy people.


Tom’s Take

QoS suffered the same fate as Ska music and NCIS. It never really went away even when people stopped caring about it as much as they did when it was the hot new thing on the block. Instead, the need for QoS disappeared when our traffic usage moved away from the usage it was designed to augment. Sure, SD-WAN has brought it back in a new form, QoS 2.0 if you will, but the need for what we used to spend hours of time doing with ancient tomes on knowledge is long gone. We should have a quiet service for QoS and acknowledge all that it has done for us. And then get ready to invite it back to the party in the form that it will take in the cloud future of tomorrow.

Why is Lync The Killer SDN Application?

lync-logo

The key to showing the promise of SDN is to find a real-world application to showcase capabilities.  I recently wrote about using SDN to slice education networks.  But this is just one idea.  When it comes to real promise, you have to shelve the approach and trot out a name.  People have to know that SDN will help them fix something on their network or optimize an troublesome program.  And it appears that application is Microsoft Lync.

MIssing Lync

Microsoft Lync (neè Microsoft Office Communicator) is a software application designed to facilitate communications.  It includes voice calling capability, instant messaging, and collaboration tools.  The voice part is particularly appealing to small businesses.  With a Microsoft Office 365 for Business subscription, you gain access to Lync.  That means introducing a voice soft client to your users.  And if it’s available, people are going to use it.

As a former voice engineer, I can tell you that soft clients are a bit of a pain to configure.  They have their own way of doing things.  Especially when Quality of Service (QoS) is involved.  In the past, tagging soft client voice packets with Cisco Jabber required setting cluster-wide parameters for all clients.  It was all-or-nothing.  There were also plans to use things like Cisco MediaNet to tag Jabber packets, but this appears to be an old method.  It was much easier to use physical phones and set their QoS value and leave the soft phones relegated to curiosities.

Lync doesn’t use a physical phone.  It’s all software based.  And as usage has grown, the need to categorize all that traffic for optimal network transmission has become important.  But configuring QoS for Lync is problematic at best.  Microsoft guidelines say to configure the Lync servers with QoS policies.  Some enterprising users have found ways to configure clients with Group Policy settings based on port numbers.  But it’s all still messy.

A Lync To The Future

That’s where SDN comes into play.  Dynamic QoS policies can be pushed into switches on the fly to recognize Lync traffic coming from hosts and adjust the network to suit high traffic volumes.  Video calls can be separated from audio calls and given different handling based on a variety of dynamically detected settings.  We can even guarantee end-to-end QoS and see that guarantee through the visibility that protocols like OpenFlow enable in a software defined network.

SDN QoS is very critical to the performance of soft clients.  Separating the user traffic from the critical communication traffic requires higher-order thinking and not group policy hacking.  Ensuring delivery end-to-end is only possible with SDN because of overall visibility.  Cisco has tried that with MediaNet and Cisco Prime, but it’s totally opt-in.  If there’s a device that Prime doesn’t know about inline, it will be a black hole.  SDN gives visibility into the entire network.

The Weakest Lync

That’s not to say that Lync doesn’t have it’s issues.  Cisco Jabber was an application built by a company with a strong networking background.  It reports information to the underlying infrastructure that allows QoS policies to work correctly.  The QoS marking method isn’t perfect, but at least it’s available.

Lync packets don’t respect the network.  Lync always assumes there will be adequate bandwidth.  Why else would it not allow for QoS tagging?  It’s also apparent when you realize that some vendors are marking packets with non-standard CoS/DSCP markings.  Lync will happily take override priority on the network.  Why doesn’t Lync listen to the traffic conditions around it?  Why does it exist in a vacuum?

Lync is an application written by application people.  It’s agnostic of networks.  It doesn’t know if it’s running on a high-speed LAN or across a slow WAN connection.  It can be ignorant of the network because that part just gets figured out.  It’s a classic example of a top-down program.  That’s why SDN holds such promise for Lync.  Because the app itself is unaware of the networks, SDN allows it to keep chugging along in bliss while the controllers and forwarding tables do all the heavy lifting.  And that’s why the tie between Lync and SDN is so strong.  Because SDN makes Lync work better without the need to actually do anything about Lync, or your server infrastructure in general.


Tom’s Take

Lync is the poster child for bad applications that can be fixed with SDN.  And when I say poster child, I mean it.  Extreme Networks, Aruba Networks, and Meru are all talking about using SDN in concert with Lync.  Some are using OpenFlow, others are using proprietary methods.  The end result is making a smarter network to handle an application living in a silo.  Cisco Jabber is easy to program for QoS because it was made by networking folks.  Lync is a pain because it lives in the same world as Office and SQL Server.  It’s only when networks become contentious that we have to find novel ways of solving problems.  Lync is the use case for SDN for small and medium enterprises focused primarily on wireless connectivity.  Because making Lync behave in that environment is indistinguishable from magic, at least without SDN.


If you want to see some interesting conversations about Lync and SDN, especially with OpenFlow, tune into SDN Connect Live on September 18th.  Meru Networks and Tech Field Day will have a roundtable discussion about Lync, featuring Lync experts and SDN practitioners.

Throw More Storage At It

Embed from Getty Images

All of  this has happened before.   All of this will happen again.

The more time I spend listening to storage engineers talk about the pressing issues they face in designing systems in this day and age, the more I’m convinced that we fight the same problems over and over again with different technologies.  Whether it be networking, storage, or even wireless, the same architecture problems crop in new ways and require different engineers to solve them all over again.

Quality is Problem One

A chance conversation with Howard Marks (@DeepStorageNet) at Storage Field Day 4 led me to start thinking about these problems.  He was talking during a presentation about the difficulty that storage vendors have faced in implementing quality of service (QoS) in storage arrays.  As Howard described some of the issues with isolating neighboring workloads and ensuring they can’t cause performance issues for a specific application, I started thinking about the implementation of low latency queuing (LLQ) for voice in networking.

LLQ was created to solve a specific problem.  High volume, low bandwidth flows can starve traditional priority queuing systems.  In much the same way, applications that demand high amounts of input/output operations per second (IOPS) while storing very little data can cause huge headaches for storage admins.  Storage has tried to solve this problem with hardware in the past by creating things like write-back caching or even super fast flash storage caching tiers.

Make It Go Away

In fact, a lot of the problems in storage mirror those from networking world many years ago.  Performance issues used to have a very simple solution – throw more hardware at the problem until  it goes away.  In networking, you throw bandwidth at the issue.  In storage, more IOPS are the answer.  When hardware isn’t pushed to the absolute limit, the answer will always be to keep driving it higher and higher.  But what happens when performance can’t fix the problem any more?

Think of a sports car like the Bugatti Veyron.  It is the fastest production car available today, with a top speed well over 250 miles per hour.  In fact, Bugatti refuses to talk about the absolute top speed, instead countering “Would you ask how fast a jet can fly?”  One of the limiting factors in attaining a true measure of the speed isn’t found in the car’s engine.  Instead, the sub systems around the engine being to fail at such stressful speeds.  At 258 miles per hour, the tires on the car will completely disintegrate in 15 minutes.  The fuel tank will be emptied in 12 minutes.  Bugatti wisely installed a governor on the engine limiting it to a maximum of 253 miles per hour in an attempt to prevent people from pushing the car to its limits.  A software control designed to prevent performance issues by creating artificial limits.  Sound familiar?

Storage has hit the wall when it comes to performance.  PCIe flash storage devices are capable of hundreds of thousands of IOPS.  A single PCI card has the throughput of an entire data center.  Hungry applications can quickly saturate the weak link in a system.  In days gone by, that was the IOPS capacity of a storage device.  Today, it’s the link that connects the flash device to the rest of the system.  Not until the advent of PCIe was the flash storage device fast enough to keep pace with workloads starved for performance.

Storage isn’t the only technology with hungry workloads stressing weak connection points.  Networking is quickly reaching this point with the shift to cloud computing.  Now, instead of the east-west traffic between data center racks being a point of contention, the connection point between the user and the public cloud will now be stressed to the breaking point.  WAN connection speeds have the same issues that flash storage devices do with non-PCIe interfaces.  They are quickly saturated by the amount of outbound traffic generated by hungry applications.  In this case, those applications are located in some far away data center instead of being next door on a storage array.  The net result is the same – resource contention.


Tom’s Take

Performance is king.  The faster, stronger, better widgets win in the marketplace.  When architecting systems it is often much easier to specify a bigger, faster device to alleviate performance problems.  Sadly, that only works to a point.  Eventually, the performance problems will shift to components that can’t be upgraded.  IOPS give way to transfer speeds on SATA connectors.  Data center traffic will be slowed by WAN uplinks.  Even the CPU in a virtual server will eventually become an issue after throwing more RAM at a problem.  Rather than just throwing performance at problems until they disappear, we need to take a long hard look at how we build things and make good decisions up front to prevent problems before they happen.

Like the Veyron engineers, we have to make smart choices about limiting the performance of a piece of hardware to ensure that other bottlenecks do not happen.  It’s not fun to explain why a given workload doesn’t get priority treatment.  It’s not glamorous to tell someone their archival data doesn’t need to live in a flash storage tier.  But it’s the decision that must be made to ensure the system is operating at peak performance.