QoS Is Dead. Long Live QoS!

Ah, good old Quality of Service. How often have we spent our time as networking professionals trying to discern the archaic texts of Szigeti to learn how to make you work? QoS is something that seemed so necessary to our networks years ago that we would spend hours upon hours trying to learn the best way to implement it for voice or bulk data traffic or some other reason. That was, until a funny thing happened. Until QoS was useless to us.

Rest In Peace and Queues

QoS didn’t die overnight. It didn’t wake up one morning without a home to go to. Instead, we slowly devalued and destroyed it over a period of years. We did it be focusing on the things that QoS was made for and then marginalizing them. Remember voice traffic?

We spent years installing voice over IP (VoIP) systems in our networks. And each of those systems needed QoS to function. We took our expertise in the arcane arts of queuing and applied it to the most finicky protocols we could find. And it worked. Our mystic knowledge made voice better! Our calls wouldn’t drop. Our packets arrived when they should. And the world was a happy place.

That is, until voice became pointless. When people started using mobile devices more and more instead of their desk phones, QoS wasn’t as important. When the steady generation of delay-sensitive packets instead moved back to LTE instead of IP it wasn’t as critical to ensure that FTP and other protocols in the LAN interfered with it. Even when people started using QoS on their mobile devices the marking was totally inconsistent. George Stefanick (@WirelesssGuru) found that Wi-Fi calling was doing some weird packet marking anyway:

So, without a huge packet generation issue, QoS was relegated to some weird traffic shaping roles. Maybe it was video prioritization in places where people cared about video? Or perhaps it was creating a scavenger class for traffic in order to get rid of unwanted applications like BitTorrent. But overall QoS languished as an oddity as more and more enterprises saw their collaboration traffic moving to be dominated by mobile devices that didn’t need the old dark magic of QoS.

QoupS de Gras

The real end of QoS came about thanks to the cloud. While we spent all of our time trying to find ways to optimize applications running on our local enterprise networks, developers were busy optimizing applications to run somewhere else. The ideas were sound enough in principle. By moving applications to the cloud we could continually improve them and push features faster. By having all the bit off the local network we could scale massively. We could even collaborate together in real time from anywhere in the world!

But applications that live in the cloud live outside our control. QoS was always bounded by the borders of our own networks. Once a packet was launched into the great beyond of the Internet we couldn’t control what happened to it. ISPs weren’t bound to honor our packet markings without an SLA. In fact, in most cases the ISP would remark all our packets anyway just to ensure they didn’t mess with the ISP’s ideas of traffic shaping. And even those were rudimentary at best given how well QoS plays with MPLS in the real world.

But cloud-based applications don’t worry about quality of service. They scale as large as you want. And nothing short of a massive cloud outage will make them unavailable. Sure, there may be some slowness here and there but that’s nothing less than you’d expect to receive running a heavy application over your local LAN. The real genius of the cloud shift is that it forced developers to slim down applications and make them more responsive in places where they could be made to be more interactive. Now, applications felt snappier when they ran in remote locations. And if you’ve every tried to use old versions of Outlook across slow links you now how critical that responsiveness can be.

The End is The Beginning

So, with cloud-based applications here to stay and collaboration all about mobile apps now, we can finally carve the tombstone for QoS right? Well, not quite.

As it turns out, we are still using lots and lots of QoS today in SD-WAN networks. We’re just not calling it that. Instead, we’ve upgraded the term to something more snappy, like “Application Visibility”. Under the hood, it’s not much different than the QoS that we’ve done for years. We’re still picking out the applications and figuring out how to optimize their traffic patterns to make them more responsive.

The key with the new wave of SD-WAN is that we’re marrying QoS to conditional routing. Now, instead of being at the mercy of the ISP link to the Internet we can do something else. We can push bulk traffic across slow cheap links and ensure that our critical business applications have all the space they want on the fast expensive ones instead. We can push our out-of-band traffic out of an attached 4G/LTE modem. We can even push our traffic across the Internet to a gateway closer to the SaaS provider with better performance. That last bit is an especially delicious piece of irony, since it basically serves the same purpose as Tail-end Hop Off did back in the voice days.

And how does all this magical new QoS work on the Internet outside our control? That’s the real magic. It’s all tunnels! Yes, in order to make sure that we get our traffic where it needs to be in SD-WAN we simply prioritize it going out of the router and wrap it all in a tunnel to the next device. Everything moves along the Internet and the hop-by-hop treatment really doesn’t care in the long run. We’re instead optimizing transit through our network based on other factors besides DSCP markings. Sure, when the traffic arrives on the other side it can be optimized based on those values. However, in the real world the only thing that most users really care about is how fast they can get their application to perform on their local machine. And if SD-WAN can point them to the fastest SaaS gateway, they’ll be happy people.


Tom’s Take

QoS suffered the same fate as Ska music and NCIS. It never really went away even when people stopped caring about it as much as they did when it was the hot new thing on the block. Instead, the need for QoS disappeared when our traffic usage moved away from the usage it was designed to augment. Sure, SD-WAN has brought it back in a new form, QoS 2.0 if you will, but the need for what we used to spend hours of time doing with ancient tomes on knowledge is long gone. We should have a quiet service for QoS and acknowledge all that it has done for us. And then get ready to invite it back to the party in the form that it will take in the cloud future of tomorrow.

Throw More Storage At It

Embed from Getty Images

All of  this has happened before.   All of this will happen again.

The more time I spend listening to storage engineers talk about the pressing issues they face in designing systems in this day and age, the more I’m convinced that we fight the same problems over and over again with different technologies.  Whether it be networking, storage, or even wireless, the same architecture problems crop in new ways and require different engineers to solve them all over again.

Quality is Problem One

A chance conversation with Howard Marks (@DeepStorageNet) at Storage Field Day 4 led me to start thinking about these problems.  He was talking during a presentation about the difficulty that storage vendors have faced in implementing quality of service (QoS) in storage arrays.  As Howard described some of the issues with isolating neighboring workloads and ensuring they can’t cause performance issues for a specific application, I started thinking about the implementation of low latency queuing (LLQ) for voice in networking.

LLQ was created to solve a specific problem.  High volume, low bandwidth flows can starve traditional priority queuing systems.  In much the same way, applications that demand high amounts of input/output operations per second (IOPS) while storing very little data can cause huge headaches for storage admins.  Storage has tried to solve this problem with hardware in the past by creating things like write-back caching or even super fast flash storage caching tiers.

Make It Go Away

In fact, a lot of the problems in storage mirror those from networking world many years ago.  Performance issues used to have a very simple solution – throw more hardware at the problem until  it goes away.  In networking, you throw bandwidth at the issue.  In storage, more IOPS are the answer.  When hardware isn’t pushed to the absolute limit, the answer will always be to keep driving it higher and higher.  But what happens when performance can’t fix the problem any more?

Think of a sports car like the Bugatti Veyron.  It is the fastest production car available today, with a top speed well over 250 miles per hour.  In fact, Bugatti refuses to talk about the absolute top speed, instead countering “Would you ask how fast a jet can fly?”  One of the limiting factors in attaining a true measure of the speed isn’t found in the car’s engine.  Instead, the sub systems around the engine being to fail at such stressful speeds.  At 258 miles per hour, the tires on the car will completely disintegrate in 15 minutes.  The fuel tank will be emptied in 12 minutes.  Bugatti wisely installed a governor on the engine limiting it to a maximum of 253 miles per hour in an attempt to prevent people from pushing the car to its limits.  A software control designed to prevent performance issues by creating artificial limits.  Sound familiar?

Storage has hit the wall when it comes to performance.  PCIe flash storage devices are capable of hundreds of thousands of IOPS.  A single PCI card has the throughput of an entire data center.  Hungry applications can quickly saturate the weak link in a system.  In days gone by, that was the IOPS capacity of a storage device.  Today, it’s the link that connects the flash device to the rest of the system.  Not until the advent of PCIe was the flash storage device fast enough to keep pace with workloads starved for performance.

Storage isn’t the only technology with hungry workloads stressing weak connection points.  Networking is quickly reaching this point with the shift to cloud computing.  Now, instead of the east-west traffic between data center racks being a point of contention, the connection point between the user and the public cloud will now be stressed to the breaking point.  WAN connection speeds have the same issues that flash storage devices do with non-PCIe interfaces.  They are quickly saturated by the amount of outbound traffic generated by hungry applications.  In this case, those applications are located in some far away data center instead of being next door on a storage array.  The net result is the same – resource contention.


Tom’s Take

Performance is king.  The faster, stronger, better widgets win in the marketplace.  When architecting systems it is often much easier to specify a bigger, faster device to alleviate performance problems.  Sadly, that only works to a point.  Eventually, the performance problems will shift to components that can’t be upgraded.  IOPS give way to transfer speeds on SATA connectors.  Data center traffic will be slowed by WAN uplinks.  Even the CPU in a virtual server will eventually become an issue after throwing more RAM at a problem.  Rather than just throwing performance at problems until they disappear, we need to take a long hard look at how we build things and make good decisions up front to prevent problems before they happen.

Like the Veyron engineers, we have to make smart choices about limiting the performance of a piece of hardware to ensure that other bottlenecks do not happen.  It’s not fun to explain why a given workload doesn’t get priority treatment.  It’s not glamorous to tell someone their archival data doesn’t need to live in a flash storage tier.  But it’s the decision that must be made to ensure the system is operating at peak performance.

SDN and Appeal of Abstraction

MeadowSunrise

During the recent Storage Field Day 4, I was listening to Howard Marks (@DeepStorageNet) talking about storage Quality of Service (QoS).  He was describing something he wanted his storage array to do that sounded suspiciously similar to strict priority queuing (or Low Latency Queuing if you speak Cisco).

As I explained how we solved the problem of allowing a specific amount of priority for a given packet stream, it finally dawned on me how software defined networking had the ability to increase productivity in organizations.  It’s not via automation, although that is a very alluring feature.  It’s because SDN can act as a language interpreter for those that don’t speak the right syntax.

It’s said that a picture is worth a thousand words.  As true as that is I would also argue that a few words can paint a thousand pictures.  It’s all in the interpretation.  If I told you I wanted a picture of a “meadow at sunrise”, you’ve probably come up with an idea in your head of what that would look like.  And the odds are good that it’s totally different from my picture.  These simple words are open to interpretation on a wide scale.  Now, let’s constrain the pictures a bit based on the recipient.  Someone that lives in France would have a totally different idea of a meadow than someone that lives in San Francisco.  Each view is totally valid, but the construction of the picture will be dictated by the thought process of the person.  They each know what the meadow looks like, they’re just a bit different.

Painting with SDN

Extend this metaphor to software.  Specifically to networking and storage.  I have a concept that I need to implement.  User A needs guaranteed access to a specific resource for a period of time that should not exceed a given threshold.  A phone may need priority queue access to a slow WAN link.  A backup client may need guaranteed access to a target server without overrunning the link bandwidth.  Two totally different use cases that can be described in the same general language.  Today, I would need two people to implement those instructions on different systems.  Programming a storage array is much different that programming a router or a switch.

But the abstraction of SDN allows my to input a given command and have it executed on dissimilar hardware via automation and integration.  The SDN management system doesn’t care that I want LLQ on a router or strict priority QoS on a backup client.  It knows that there should be an implementation of the given commands on a set of attached systems.  It’s up to the API integration with the systems to determine the syntax needed to execute the commands.  As a network engineer, I don’t need to know the commands to create the QoS construct on the storage array.  I just need to know what I want to accomplish.  The software takes care of the rest.

SDN could usher in a new age for natural language programming.  I often hear my friends in the CCIE community complaining that people only learn the commands to make something happen.  They ignore the basic concepts behind why something works the way it does.  If I can’t explain the concept to my 8-year old son, I don’t know it very well.  I can’t abstract it to the point where a simpler mind can understand.  What if that simple mind had the capability to translate those instructions into actionable programming?  What if I just had to state what I wanted to do?  “Ensure that all traffic on Link A is given priority treatment, except if it is for the backup server.”  The system can create API calls to program the access control lists and storage arrays to take care of all that without needing to fire up a CLI session or a browser.

This is going to require a lot of programming on the front end.  The management system needs to be able to interpret keywords in the instruction set to pick out the right execution items.  Then the system needs to know where to dispatch the commands to make them land on the right APIs.  In the case of the above example, the system would need to know to send two different sets of commands – one to the storage array to provide metered access during backup windows and another set to the networking gear to ensure the right QoS policies were enforced during the given time window.  Oh, and you’re going to want to have the system go back and clean up those ACLs when it’s time for production hours to start again.


Tom’s Take

This isn’t going to be easy by any means.  But adding all the value into the front end management system means that any other system attached on the backend via API is going to benefit.  I’d rather be doing the work in the right places as opposed to spending all our time on the backend and neglecting the interface to the whole system.  Engineers are notorious for writing terrible GUIs.  Let’s take the time to abstract the bad things away and instead make something that’s really going to change the way we program systems.