Don’t Build Big Data With Bad Data

I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.

Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

Networking Needs Information, Not Data


Networking Field Day 12 starts today. There are a lot of great presenters lined up. As I talk to more and more networking companies, it’s becoming obvious that simply moving packets is not the way to go now. Instead, the real sizzle is in telling you all about those packets instead. Not packet inspection but analytics.

Tell Me More, Tell Me More

Ask any networking professional and they’ll tell you that the systems they manage have a wealth of information. SNMP can give you monitoring data for a set of points defined in database files. Other protocols like NetFlow or sFlow can give you more granular data about a particular packet group of data flow in your network. Even more advanced projects like Intel’s Snap are building on the idea of using telemetry to collect disparate data sources and build collection methodologies to do something with them.

The concern that becomes quickly apparent is the overwhelming amount of data being received from all these sources. It reminds me a bit of this scene:

How can you drink from this firehose? Maybe you should be asking if you should instead?

Order From Chaos

Data is useless. We need to perform analysis on it to get information. That’s where a new wave of companies is coming into the networking market. They are building on the frameworks and systems that are aggregating data and presenting it in a way that makes it useful information. Instead of random data points about NetFlow, these solutions tell you that you’ve got a huge problem with outbound traffic of a specific type that is sent at a specific time with a specific payload. The difference is that instead of sorting through data to make sense of it, you’ve got a tool delivering the analysis instead of the raw data.

Sometimes it’s as simple as color-coding lines of Wireshark captures. Resets are bad, so they show up red. Properly torn down connections are good so they are green. You can instantly figure out how good things are going by looking for the colors. That’s analysis from raw data. The real trick in modern networking monitoring is to find a way to analyze and provide context for massive amounts of data that may not have an immediate correlation.

Networking professionals are smart people. They can intuit a lot of potential issues from a given data set. They can make the logical leap to a specific issue given time. What reduces that ability is the sheer amount of things that can go wrong with a particular system and the speed at which those problems must be fixed, especially at scale. A hiccup on one end of the network can be catastrophic on the others if allowed to persist.

Analytics can give us the context we need. It can provide confidence levels for common problems. It can ensure that symptoms are indeed happening above a given baseline or threshold. It can help us narrow the symptoms and potential issues before we even look at the data. Analytics can exclude the impossible while highlighting the more probably causes and outcomes. Analytics can give us peace of mind.

Tom’s Take

Analytics isn’t doing our job for us. Instead, it’s giving us the ability to concentrate. Anyone that spends their time sifting through data to try and find patterns is losing the signal in the noise. Patterns are things that software can find easily. We need to leverage the work being put into network analytics systems to help us track down the issues before they blow up into full problems. We need to apply the thing that makes network professionals the best suited to look at the best information we can gather about a situation. Our concentration on what matters is where our job will be in five years. Let’s take the knowledge we have and apply it.

Gathering No MOS


If you work in the voice or video world, you’ve undoubtedly heard about Mean Opinion Scores (MOS). MOS is a rough way of ranking the quality of the sound on a call. It’s widely used to determine the over experience for the user on the other end of the phone. MOS represents something important in the grand scheme of communications. However, MOS is quickly becoming a crutch that needs some explanation.

That’s Just Like Your Opinion

The first think to keep in mind when you look at MOS data is that the second word in the term is opinion. Originally, MOS was derived by having selected people listen to calls and rank them on a scale of 1 (I can’t hear you) to 5 (We’re sitting next to each other). The idea was to see if listeners could distinguish when certain aspects of the call were changed, such as pathing or exchange equipment. It was an all-or-nothing ranking. Good calls got a 4 or even rarely a 5. Most terrible calls got 2 or 3. You take the average of all your subjects and that gives your the overall MOS for your system.


When digital systems came along, MOS took on an entirely different meaning. Rather than being used to subjectively rank call quality, MOS became a yardstick for tweaking the codec used to digitally transform analog speech to digital packets. Since this has to happen in order for the data to be sent, all digital calls must have a codec somewhere. The first codecs were trying to approximate the quality of a long distance phone call, which was the gold standard for quality. After that target was reached, providers started messing around the codecs in question to reduce bandwidth usage.

G.711 is considered the baseline level of call quality from which all others are measure. It has a relative MOS of 4.1, which means very good voice quality. It also uses around 64 kbps of bandwidth. As developers started playing with encoding schemes and other factors, they started developing codecs which used significantly less bandwidth and had almost equivalent quality. G.729 uses only 8 kbps of bandwidth but has a MOS of 3.9. It’s almost as good as G.711 in most cases but uses an eighth of the resources.

MOS has always been subjective. That was until VoIP system providers found that certain network metrics have an impact on the quality of a call. Things like packet loss, delay, and jitter all have negative impacts on call quality. By measuring these values a system could give an approximation of MOS for an admin without needing to go through the hassle of making people actually listen to the calls. That data could then be provided through analytics dashboards as an input into the overall health of the system.

Like A Rolling Stone

The problem with MOS is that it has always been subjective. Two identical calls may have different MOS scores based on the listener. Two radically different codecs could have similar MOS scores because of simple factors like tonality or speech isolation. Using a subjective ranking matrix to display empirical data is unwieldy at best. The only reason to use MOS as a yardstick is because everyone understands what MOS is.

Enter R-values. R-values take inputs from the same monitoring systems that produce MOS and rank those inputs on a scale of 1 – 100. Those scores can then be ranked with more precision to determine call quality and VoIP network health. A call in the 90s is a great call. If things dip in the 70s or the 60s, there are major issues to identify. R-values solve the problem of trying to bolt empirical data onto a subjective system.

Now that communications is becoming more and more focused on things like video, the need for analytics around them is becoming more pronounced. People want to track the same kinds of metrics – codec quality, packet loss, delay, and jitter. But there isn’t a unified score that can be presented in green, yellow, and red to let people know when things are hitting the fan.

It has been suggested that MOS be adapted to reference video in addition to audio. While the idea behind using a traditional yardstick like MOS sounds good on the surface, the reality is that video is a much more complicated thing that can’t be encompassed by a 50-year-old ranking method like MOS.

Video calls can look horrible and sound great. They can have horrible sound and be crystal clear from a picture perspective. There are many, many subjective pieces that can go into ranking a video call. Trying to shoehorn that into a simple scale of 5 values is doing a real disservice to video codec manufacturers, not to mention the network operators that try and keep things running smoothly for video users.

R-value seems to be a much better way to classify analytics for video. It’s much more nuanced and capable of offering insight into different aspects of call and picture quality. It can still provide a ranked score for threshold measuring, but that rank is much more likely to mean something important for each number as opposed to the absolute values present in MOS.

Tom’s Take

MOS is an old fashioned idea that tries valiantly to tie the telecom of old to the digital age. People who understood subjective quality tried to pair it with objective analytics in an effort to keep the old world and the new world matched. But even communications is starting to eclipse these bounds. Phone calls have given way to email, texting, and video chats. Two of those are asynchronous and require no network reliability beyond up or down. Video, and all the other real-time digital communications, needs to have the right metrics and analytics to provide good feedback about how to improve the experience for users. And whatever we end up calling that composite metric or ranked algorithmic score, it shouldn’t be called MOS. Time to let that term grow some moss in the retirement bin.


Replacing Nielsen With Big Data

I’m a fan of television. I like watching interesting programs. It’s been a few years since one has caught my attention long enough to keep my watching for multiple seasons. Part of that reason is due to my fear that an awesome program is going to fail to reach a “target market” and will end up canceled just when it’s getting good. It’s happened to several programs that I liked in the past.

Sampling the Goods

Part of the issue with tracking the popularity of television programs comes from the archaic method in which programs are measured. Almost everyone has heard of the Nielsen ratings system. This sampling method was created in the 1930s as a way to measure radio advertising reach. In the 50s, it was adapted for use in television.

Nielsen selects target audiences that represent the greater whole. They ask users to keep written diaries of their television watching habits. They also have the ability to install a device called a set meter which allows viewers to punch in a code to identify themselves via age groups and lock in a view for a program. The set meter can tell the instant a channel is changed or the TV is powered off.

In theory, the sampling methodology is sound. In practice, it’s a bit shaky. View diaries are unreliable because people tend to overreport their view habits. If they feel guilty that they haven’t been writing anything down, they tend to look up something that was on TV and write it down. Diaries also can’t determine if a viewer watched the entire program or changed the channel in the middle. Set meters aren’t much better. The reliance on PIN codes to identify users can lead to misreported results. Adults in a hurry will sometimes punch in an easier code assigned to their children, leading to skewed age results.

Both the diary and the set meter fail to take into account the shift in viewing habits in modern households. Most of the TV viewing in my house takes place through time-shifted DVR recordings. My kids tend to monopolize the TV during the day, but even they are moving to using services like Netflix to watch all episodes of their favorite cartoons in one sitting. Neither of these viewing habits are easily tracked by Nielsen.

How can we find a happy medium? Sample sizes have been reduced signifcantly due to cord-cutting households moving to Internet distribution models. People tend to exaggerate or manipulate self-reported viewing results. Even “modern” Nielsen technology can’t keep up. What’s the answer?

Big Data

I know what you’re saying: “We’ve already got that with Nielsen, right?” Not quite. TV viewing habits have shifted in the past few years. So has TV technology. Thanks to the shift from analog broadcast signals to digital and the explosion of set top boxes for cable decryption and movie service usage, we now have a huge portal into the living room of every TV watcher in the world.

Think about for a moment. The idea of a sample size works provided it’s a good representative sample. But tracking this data is problematic. If we have access to a way to crunch the actual data instead of extrapolating from incomplete sets shouldn’t we use that instead? I’d rather believe the real numbers instead of trying to guess from unreliable sources.

This also fixes the issue of time-shifted viewing. Those same set top boxes are often responsible for recording the programs. They could provide information such as number of shows recorded versus viewed and whether or not viewers skip through commercials. For those that view on mobile devices that data could be compiled as well through integration with the set top box. User logins are required for mobile apps as it is today. It’s just a small step to integrating the whole package.

It would require a bit of technical upgrading on the client side. We would have to enable the set top boxes to report data back to a service. We could anonymize the data to a point to be sure that people aren’t being unnecessarily exposed. It will also have to be configured as an opt-out setting to ensure that the majority is represented. Opt-in won’t work because those checkboxes never get checked.

Advertisers are going to demand the most specific information about people that they can. The ratings service exists to work for the advertisers. If this plan is going to work, a new company will have to be created to collect and analyze this data. This way the analysis company can ensure that the data is specific enough to be of use to the advertisers while at the same time ensuring the protection of the viewers.

Tom’s Take

Every year, promising new TV shows are yanked off the airwaves because advertisers don’t see any revenue. Shows that have a great premise can’t get up to steam because of ratings. We need to fix this system. In the old days, the deluge of data would have drown Nielsen. Today, we have the technology to collect, analyze, and store that data for eternity. We can finally get the real statistics on how many people watched Jericho or After MASH. Armed with real numbers, we can make intelligent decisions about what to keep on TV and what to jettison. And that’s a big data project I’d be willing to watch.