Don’t Build Big Data With Bad Data


I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.


Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

1 thought on “Don’t Build Big Data With Bad Data

  1. Pingback: Don’t Build Big Data With Bad Data - Tech Field Day

Leave a comment