QoS Is Dead. Long Live QoS!

Ah, good old Quality of Service. How often have we spent our time as networking professionals trying to discern the archaic texts of Szigeti to learn how to make you work? QoS is something that seemed so necessary to our networks years ago that we would spend hours upon hours trying to learn the best way to implement it for voice or bulk data traffic or some other reason. That was, until a funny thing happened. Until QoS was useless to us.

Rest In Peace and Queues

QoS didn’t die overnight. It didn’t wake up one morning without a home to go to. Instead, we slowly devalued and destroyed it over a period of years. We did it be focusing on the things that QoS was made for and then marginalizing them. Remember voice traffic?

We spent years installing voice over IP (VoIP) systems in our networks. And each of those systems needed QoS to function. We took our expertise in the arcane arts of queuing and applied it to the most finicky protocols we could find. And it worked. Our mystic knowledge made voice better! Our calls wouldn’t drop. Our packets arrived when they should. And the world was a happy place.

That is, until voice became pointless. When people started using mobile devices more and more instead of their desk phones, QoS wasn’t as important. When the steady generation of delay-sensitive packets instead moved back to LTE instead of IP it wasn’t as critical to ensure that FTP and other protocols in the LAN interfered with it. Even when people started using QoS on their mobile devices the marking was totally inconsistent. George Stefanick (@WirelesssGuru) found that Wi-Fi calling was doing some weird packet marking anyway:

So, without a huge packet generation issue, QoS was relegated to some weird traffic shaping roles. Maybe it was video prioritization in places where people cared about video? Or perhaps it was creating a scavenger class for traffic in order to get rid of unwanted applications like BitTorrent. But overall QoS languished as an oddity as more and more enterprises saw their collaboration traffic moving to be dominated by mobile devices that didn’t need the old dark magic of QoS.

QoupS de Gras

The real end of QoS came about thanks to the cloud. While we spent all of our time trying to find ways to optimize applications running on our local enterprise networks, developers were busy optimizing applications to run somewhere else. The ideas were sound enough in principle. By moving applications to the cloud we could continually improve them and push features faster. By having all the bit off the local network we could scale massively. We could even collaborate together in real time from anywhere in the world!

But applications that live in the cloud live outside our control. QoS was always bounded by the borders of our own networks. Once a packet was launched into the great beyond of the Internet we couldn’t control what happened to it. ISPs weren’t bound to honor our packet markings without an SLA. In fact, in most cases the ISP would remark all our packets anyway just to ensure they didn’t mess with the ISP’s ideas of traffic shaping. And even those were rudimentary at best given how well QoS plays with MPLS in the real world.

But cloud-based applications don’t worry about quality of service. They scale as large as you want. And nothing short of a massive cloud outage will make them unavailable. Sure, there may be some slowness here and there but that’s nothing less than you’d expect to receive running a heavy application over your local LAN. The real genius of the cloud shift is that it forced developers to slim down applications and make them more responsive in places where they could be made to be more interactive. Now, applications felt snappier when they ran in remote locations. And if you’ve every tried to use old versions of Outlook across slow links you now how critical that responsiveness can be.

The End is The Beginning

So, with cloud-based applications here to stay and collaboration all about mobile apps now, we can finally carve the tombstone for QoS right? Well, not quite.

As it turns out, we are still using lots and lots of QoS today in SD-WAN networks. We’re just not calling it that. Instead, we’ve upgraded the term to something more snappy, like “Application Visibility”. Under the hood, it’s not much different than the QoS that we’ve done for years. We’re still picking out the applications and figuring out how to optimize their traffic patterns to make them more responsive.

The key with the new wave of SD-WAN is that we’re marrying QoS to conditional routing. Now, instead of being at the mercy of the ISP link to the Internet we can do something else. We can push bulk traffic across slow cheap links and ensure that our critical business applications have all the space they want on the fast expensive ones instead. We can push our out-of-band traffic out of an attached 4G/LTE modem. We can even push our traffic across the Internet to a gateway closer to the SaaS provider with better performance. That last bit is an especially delicious piece of irony, since it basically serves the same purpose as Tail-end Hop Off did back in the voice days.

And how does all this magical new QoS work on the Internet outside our control? That’s the real magic. It’s all tunnels! Yes, in order to make sure that we get our traffic where it needs to be in SD-WAN we simply prioritize it going out of the router and wrap it all in a tunnel to the next device. Everything moves along the Internet and the hop-by-hop treatment really doesn’t care in the long run. We’re instead optimizing transit through our network based on other factors besides DSCP markings. Sure, when the traffic arrives on the other side it can be optimized based on those values. However, in the real world the only thing that most users really care about is how fast they can get their application to perform on their local machine. And if SD-WAN can point them to the fastest SaaS gateway, they’ll be happy people.


Tom’s Take

QoS suffered the same fate as Ska music and NCIS. It never really went away even when people stopped caring about it as much as they did when it was the hot new thing on the block. Instead, the need for QoS disappeared when our traffic usage moved away from the usage it was designed to augment. Sure, SD-WAN has brought it back in a new form, QoS 2.0 if you will, but the need for what we used to spend hours of time doing with ancient tomes on knowledge is long gone. We should have a quiet service for QoS and acknowledge all that it has done for us. And then get ready to invite it back to the party in the form that it will take in the cloud future of tomorrow.

Why is Lync The Killer SDN Application?

lync-logo

The key to showing the promise of SDN is to find a real-world application to showcase capabilities.  I recently wrote about using SDN to slice education networks.  But this is just one idea.  When it comes to real promise, you have to shelve the approach and trot out a name.  People have to know that SDN will help them fix something on their network or optimize an troublesome program.  And it appears that application is Microsoft Lync.

MIssing Lync

Microsoft Lync (neè Microsoft Office Communicator) is a software application designed to facilitate communications.  It includes voice calling capability, instant messaging, and collaboration tools.  The voice part is particularly appealing to small businesses.  With a Microsoft Office 365 for Business subscription, you gain access to Lync.  That means introducing a voice soft client to your users.  And if it’s available, people are going to use it.

As a former voice engineer, I can tell you that soft clients are a bit of a pain to configure.  They have their own way of doing things.  Especially when Quality of Service (QoS) is involved.  In the past, tagging soft client voice packets with Cisco Jabber required setting cluster-wide parameters for all clients.  It was all-or-nothing.  There were also plans to use things like Cisco MediaNet to tag Jabber packets, but this appears to be an old method.  It was much easier to use physical phones and set their QoS value and leave the soft phones relegated to curiosities.

Lync doesn’t use a physical phone.  It’s all software based.  And as usage has grown, the need to categorize all that traffic for optimal network transmission has become important.  But configuring QoS for Lync is problematic at best.  Microsoft guidelines say to configure the Lync servers with QoS policies.  Some enterprising users have found ways to configure clients with Group Policy settings based on port numbers.  But it’s all still messy.

A Lync To The Future

That’s where SDN comes into play.  Dynamic QoS policies can be pushed into switches on the fly to recognize Lync traffic coming from hosts and adjust the network to suit high traffic volumes.  Video calls can be separated from audio calls and given different handling based on a variety of dynamically detected settings.  We can even guarantee end-to-end QoS and see that guarantee through the visibility that protocols like OpenFlow enable in a software defined network.

SDN QoS is very critical to the performance of soft clients.  Separating the user traffic from the critical communication traffic requires higher-order thinking and not group policy hacking.  Ensuring delivery end-to-end is only possible with SDN because of overall visibility.  Cisco has tried that with MediaNet and Cisco Prime, but it’s totally opt-in.  If there’s a device that Prime doesn’t know about inline, it will be a black hole.  SDN gives visibility into the entire network.

The Weakest Lync

That’s not to say that Lync doesn’t have it’s issues.  Cisco Jabber was an application built by a company with a strong networking background.  It reports information to the underlying infrastructure that allows QoS policies to work correctly.  The QoS marking method isn’t perfect, but at least it’s available.

Lync packets don’t respect the network.  Lync always assumes there will be adequate bandwidth.  Why else would it not allow for QoS tagging?  It’s also apparent when you realize that some vendors are marking packets with non-standard CoS/DSCP markings.  Lync will happily take override priority on the network.  Why doesn’t Lync listen to the traffic conditions around it?  Why does it exist in a vacuum?

Lync is an application written by application people.  It’s agnostic of networks.  It doesn’t know if it’s running on a high-speed LAN or across a slow WAN connection.  It can be ignorant of the network because that part just gets figured out.  It’s a classic example of a top-down program.  That’s why SDN holds such promise for Lync.  Because the app itself is unaware of the networks, SDN allows it to keep chugging along in bliss while the controllers and forwarding tables do all the heavy lifting.  And that’s why the tie between Lync and SDN is so strong.  Because SDN makes Lync work better without the need to actually do anything about Lync, or your server infrastructure in general.


Tom’s Take

Lync is the poster child for bad applications that can be fixed with SDN.  And when I say poster child, I mean it.  Extreme Networks, Aruba Networks, and Meru are all talking about using SDN in concert with Lync.  Some are using OpenFlow, others are using proprietary methods.  The end result is making a smarter network to handle an application living in a silo.  Cisco Jabber is easy to program for QoS because it was made by networking folks.  Lync is a pain because it lives in the same world as Office and SQL Server.  It’s only when networks become contentious that we have to find novel ways of solving problems.  Lync is the use case for SDN for small and medium enterprises focused primarily on wireless connectivity.  Because making Lync behave in that environment is indistinguishable from magic, at least without SDN.


If you want to see some interesting conversations about Lync and SDN, especially with OpenFlow, tune into SDN Connect Live on September 18th.  Meru Networks and Tech Field Day will have a roundtable discussion about Lync, featuring Lync experts and SDN practitioners.

Throw More Storage At It

Embed from Getty Images

All of  this has happened before.   All of this will happen again.

The more time I spend listening to storage engineers talk about the pressing issues they face in designing systems in this day and age, the more I’m convinced that we fight the same problems over and over again with different technologies.  Whether it be networking, storage, or even wireless, the same architecture problems crop in new ways and require different engineers to solve them all over again.

Quality is Problem One

A chance conversation with Howard Marks (@DeepStorageNet) at Storage Field Day 4 led me to start thinking about these problems.  He was talking during a presentation about the difficulty that storage vendors have faced in implementing quality of service (QoS) in storage arrays.  As Howard described some of the issues with isolating neighboring workloads and ensuring they can’t cause performance issues for a specific application, I started thinking about the implementation of low latency queuing (LLQ) for voice in networking.

LLQ was created to solve a specific problem.  High volume, low bandwidth flows can starve traditional priority queuing systems.  In much the same way, applications that demand high amounts of input/output operations per second (IOPS) while storing very little data can cause huge headaches for storage admins.  Storage has tried to solve this problem with hardware in the past by creating things like write-back caching or even super fast flash storage caching tiers.

Make It Go Away

In fact, a lot of the problems in storage mirror those from networking world many years ago.  Performance issues used to have a very simple solution – throw more hardware at the problem until  it goes away.  In networking, you throw bandwidth at the issue.  In storage, more IOPS are the answer.  When hardware isn’t pushed to the absolute limit, the answer will always be to keep driving it higher and higher.  But what happens when performance can’t fix the problem any more?

Think of a sports car like the Bugatti Veyron.  It is the fastest production car available today, with a top speed well over 250 miles per hour.  In fact, Bugatti refuses to talk about the absolute top speed, instead countering “Would you ask how fast a jet can fly?”  One of the limiting factors in attaining a true measure of the speed isn’t found in the car’s engine.  Instead, the sub systems around the engine being to fail at such stressful speeds.  At 258 miles per hour, the tires on the car will completely disintegrate in 15 minutes.  The fuel tank will be emptied in 12 minutes.  Bugatti wisely installed a governor on the engine limiting it to a maximum of 253 miles per hour in an attempt to prevent people from pushing the car to its limits.  A software control designed to prevent performance issues by creating artificial limits.  Sound familiar?

Storage has hit the wall when it comes to performance.  PCIe flash storage devices are capable of hundreds of thousands of IOPS.  A single PCI card has the throughput of an entire data center.  Hungry applications can quickly saturate the weak link in a system.  In days gone by, that was the IOPS capacity of a storage device.  Today, it’s the link that connects the flash device to the rest of the system.  Not until the advent of PCIe was the flash storage device fast enough to keep pace with workloads starved for performance.

Storage isn’t the only technology with hungry workloads stressing weak connection points.  Networking is quickly reaching this point with the shift to cloud computing.  Now, instead of the east-west traffic between data center racks being a point of contention, the connection point between the user and the public cloud will now be stressed to the breaking point.  WAN connection speeds have the same issues that flash storage devices do with non-PCIe interfaces.  They are quickly saturated by the amount of outbound traffic generated by hungry applications.  In this case, those applications are located in some far away data center instead of being next door on a storage array.  The net result is the same – resource contention.


Tom’s Take

Performance is king.  The faster, stronger, better widgets win in the marketplace.  When architecting systems it is often much easier to specify a bigger, faster device to alleviate performance problems.  Sadly, that only works to a point.  Eventually, the performance problems will shift to components that can’t be upgraded.  IOPS give way to transfer speeds on SATA connectors.  Data center traffic will be slowed by WAN uplinks.  Even the CPU in a virtual server will eventually become an issue after throwing more RAM at a problem.  Rather than just throwing performance at problems until they disappear, we need to take a long hard look at how we build things and make good decisions up front to prevent problems before they happen.

Like the Veyron engineers, we have to make smart choices about limiting the performance of a piece of hardware to ensure that other bottlenecks do not happen.  It’s not fun to explain why a given workload doesn’t get priority treatment.  It’s not glamorous to tell someone their archival data doesn’t need to live in a flash storage tier.  But it’s the decision that must be made to ensure the system is operating at peak performance.