Devaluing Data Exposures

I had a great time this week recording the first episode of a new series with my co-worker Rich Stroffolino. The Gestalt IT Rundown is hopefully the start of some fun news stories with a hint of snark and humor thrown in.

One of the things I discussed in this episode was my belief that no data is truly secure any more. Thanks to recent attacks like WannaCry and Bad Rabbit and the rise of other state-sponsored hacking and malware attacks, I’m totally behind the idea that soon everyone will know everything about me and there’s nothing that anyone can do about it.

Just Pick Up The Phone

Personal data is important. Some pieces of personal data are sacrificed for the greater good. Anyone who is in IT or works in an area where they deal with spam emails and robocalls has probably paused for a moment before putting contact information down on a form. I have an old Hotmail address I use to catch spam if I’m relative certain that something looks shady. I give out my home phone number freely because I never answer it. These pieces of personal data have been sacrificed in order to provide me a modicum of privacy.

But what about other things that we guard jealously? How about our mobile phone number. When I worked for a VAR that was the single most secretive piece of information I owned. No one, aside from my coworkers, had my mobile number. In part, it’s because I wanted to make sure that it got used properly. But also because I knew that as soon as one person at the customer site had it, soon everyone would. I would be spending my time answering phone calls instead of working on tickets.

That’s the world we live in today. So many pieces of information about us are being stored. Our Social Security Number, which has truthfully been misappropriated as an identification number. US Driver’s Licenses, which are also used as identification. Passport numbers, credit ratings, mother’s maiden name (which is very handy for opening accounts in your name). The list could be a blog post in and of itself. But why is all of this data being stored?

Data Is The New Oil

The first time I heard someone in a keynote use the phrase “big data is the new oil”, I almost puked. Not because it’s a platitude the underscores the value of data. I lost it because I know what people do with vital resources like oil, gold, and diamonds. They horde them. Stockpiling the resources until they can be refined. Until every ounce of value can be extracted. Then the shell is discarded until it becomes a hazard.

Don’t believe me? I live in a state that is legally required to run radio and television advertisements telling children not to play around old oilfield equipment that hasn’t been operational in decades. It’s cheaper for them to buy commercials than it is to clean up their mess. And that precious resource? It’s old news. Companies that extract resources just move on to the next easy source instead of cleaning up their leftovers.

Why does that matter to you? Think about all the pieces of data that are stored somewhere that could possibly leak out about you. Phone numbers, date of birth, names of children or spouses. And those are the easy ones. Imagine how many places your SSN is currently stored. Now, imagine half of those companies go out of business in the next three years. What happens to your data then? You can better believe that it’s not going to get destroyed or encrypted in such a way as to prevent exposure. It’s going to lie fallow on some forgotten server until someone finds it and plunders it. Your only real hope is that it was being stored on a cloud provider that destroys the storage buckets after the bill isn’t paid for six months.

Devaluing Data

How do we fix all this? Can this be fixed? Well, it might be able to be done, but it’s not going to be fun, cheap, or easy. It all starts by making discrete data less valuable. An SSN is worthless without a name attached to it, for instance. If all I have are 9 random numbers with no context I can’t tell what they’re supposed to be. The value only comes when those 9 numbers can be matched to a name.

We’ve got to stop using SSN as a unique identifier for a person. It was never designed for that purpose. In fact, storing SSN as all is a really bad idea. Users should be assigned a new, random ID number when creating an account or filling out a form. SSN shouldn’t be stored unless absolutely necessary. And when it is, it should be treated like a nuclear launch code. It should take special authority to query it, and the database that queries it should be directly attached to anything else.

Critical data should be stored in a vault that can only be accessed in certain ways and never exposed. A prime example is the trusted enclave in an iPhone. This enclave, when used for TouchID or FaceID, stores your fingerprints and your face map. Pretty important stuff, yes? However, even with biometric ID systems become more prevalent there isn’t any way to extract that data from the enclave. It’s stored in such a way that it can only be queried in a specific manner and a result of yes/no returned from the query. If you stole my iPhone tomorrow, there’s no way for you to reconstruct my fingerprints from it. That’s the template we need to use going forward to protect our data.


Tom’s Take

I’m getting tired of being told that my data is being spread to the four winds thanks to it lying around waiting to be used for both legitimate and nefarious purposes. We can’t build fences high enough around critical data to keep it from being broken into. We can’t keep people out, so we need to start making the data less valuable. Instead of keeping it all together where it can be reconstructed into something of immense value, we need to make it hard to get all the pieces together at any one time. That means it’s going to be tough for us to build systems that put it all together too. But wouldn’t you rather spend your time solving a fun problem like that rather than making phone calls telling people your SSN got exposed on the open market?

Advertisements

Don’t Build Big Data With Bad Data

I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.


Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

AI, Machine Learning, and The Hitchhiker’s Guide

Deep_Thought

I had a great conversation with Ed Horley (@EHorley) and Patrick Hubbard (@FerventGeek) last night around new technologies. We were waxing intellectual about all things related to advances in analytics and intelligence. There’s been more than a few questions here at VMworld 2016 about the roles that machine learning and artificial intelligence will play in the future of IT. But during the conversation with Ed and Patrick, I finally hit on the perfect analogy for machine learning and artificial intelligence (AI). It’s pretty easy to follow along, so don’t panic.

The Answer

Machine learning is an amazing technology. It can extrapolate patterns in large data sets and provide insight from seemingly random things. It can also teach machines to think about problems and find solutions. Rather than go back to the tired Target big data example, I much prefer this example of a computer learning to play Super Mario World:

You can see how the algorithms learn how to play the game and find newer, better paths throughout the level. One of the things that’s always struck me about the computer’s decision skills is how early it learned that spin jumps provide more benefit than regular jumps for a given input. You can see the point in the video when this is figured out by the system, whereafter all jumps become spinning for maximum effect.

Machine learning appears to be insightful and amazing. But the weakness of machine learning is be exemplified by Deep Thought from The Hitchhiker’s Guide to the Galaxy. Deep Thought was created to find the answer to the ultimate question of life, the universe, and everything. It was programmed with an enormous dataset – literally every piece of knowledge in the known universe. After seven million years, it finally produces The Answer (42, if you’re curious). Which leads to the plot of the book and other hijinks.

Machine learning is capable of great leaps of logic, but it operates on a fundamental truth: all inputs are a bounded data set. Whether you are performing a simple test on a small data set or combing all the information in the universe for answers you are still operating on finite information. Machine learning can do things very fast and find pieces of data that are not immediately obvious. But it can’t operate outside the bounds of the data set without additional input. Even the largest analytics clusters won’t produce additional output without more data being ingested. Machine learning is capable of doing amazing things. But it won’t ever create new information outside of what it is operating on.

The Question

Artificial Intelligence (AI), on the other hand, is more like the question in The Hitchhiker’s Guide. Deep Thought admonishes the users of the system that rather than looking for the answer to Life, The Universe, and Everything, they should have been looking for The Question instead. That involves creating a completely different computer to find the Question that matches the Answer that Deep Thought has already provided.

AI can provide insight above and beyond a given data set input. It can provide context where none exists. It can make leaps of logic similarly to those that humans are capable of doing. AI doesn’t simply stop when it faces an incomplete data set. Even though we are seeing AI in infancy today, the most advanced systems are capable of “filling in the blanks” to cover missing information. As the algorithms learn more and more how to extrapolate they’ll become better at making incomplete decisions.

The reason why computers are so good at making quick decisions is because they don’t operate outside the bounds of the possible. If the entire universe for a decision is a data set, they won’t try to look around that. That ability to look beyond and try to create new data where none exists is the hallmark of intelligence. Using tools to create is a uniquely biologic function. Computers can create subsets of data with tools but they can’t do a complete transformation.

AI is pushing those boundaries. Given enough time and the proper input, AI can make the leaps outside of bounds to come up with new ideas. Today it’s all about context. Tomorrow may find AI providing true creativity. AI will eventually pass a Turing Test because it can look outside the script and provide the pseudorandom type of conversation that people are capable of.


Tom’s Take

Computers are smart. They think faster than we do. They can do math better than we can. They can produce results in a fraction of the time at a scale that boggles the mind. But machine learning is still working from a known set. No matter how much work we pour into that aspect of things, we are still going to hit a limit of the ability of the system to break out of it’s bounds.

True AI will behave like we do. It will look for answers when their are none. It will imagine when confronted with the impossible. It will learn from mistakes and create solutions to them. That’s the power of intelligence versus learning. Even the most power computer in the universe couldn’t break out of its programming. It needed something more to question the answers it created. Just like us, it needed to sit back and let its mind wander.

Networking Needs Information, Not Data

GameAfoot

Networking Field Day 12 starts today. There are a lot of great presenters lined up. As I talk to more and more networking companies, it’s becoming obvious that simply moving packets is not the way to go now. Instead, the real sizzle is in telling you all about those packets instead. Not packet inspection but analytics.

Tell Me More, Tell Me More

Ask any networking professional and they’ll tell you that the systems they manage have a wealth of information. SNMP can give you monitoring data for a set of points defined in database files. Other protocols like NetFlow or sFlow can give you more granular data about a particular packet group of data flow in your network. Even more advanced projects like Intel’s Snap are building on the idea of using telemetry to collect disparate data sources and build collection methodologies to do something with them.

The concern that becomes quickly apparent is the overwhelming amount of data being received from all these sources. It reminds me a bit of this scene:

How can you drink from this firehose? Maybe you should be asking if you should instead?

Order From Chaos

Data is useless. We need to perform analysis on it to get information. That’s where a new wave of companies is coming into the networking market. They are building on the frameworks and systems that are aggregating data and presenting it in a way that makes it useful information. Instead of random data points about NetFlow, these solutions tell you that you’ve got a huge problem with outbound traffic of a specific type that is sent at a specific time with a specific payload. The difference is that instead of sorting through data to make sense of it, you’ve got a tool delivering the analysis instead of the raw data.

Sometimes it’s as simple as color-coding lines of Wireshark captures. Resets are bad, so they show up red. Properly torn down connections are good so they are green. You can instantly figure out how good things are going by looking for the colors. That’s analysis from raw data. The real trick in modern networking monitoring is to find a way to analyze and provide context for massive amounts of data that may not have an immediate correlation.

Networking professionals are smart people. They can intuit a lot of potential issues from a given data set. They can make the logical leap to a specific issue given time. What reduces that ability is the sheer amount of things that can go wrong with a particular system and the speed at which those problems must be fixed, especially at scale. A hiccup on one end of the network can be catastrophic on the others if allowed to persist.

Analytics can give us the context we need. It can provide confidence levels for common problems. It can ensure that symptoms are indeed happening above a given baseline or threshold. It can help us narrow the symptoms and potential issues before we even look at the data. Analytics can exclude the impossible while highlighting the more probably causes and outcomes. Analytics can give us peace of mind.


Tom’s Take

Analytics isn’t doing our job for us. Instead, it’s giving us the ability to concentrate. Anyone that spends their time sifting through data to try and find patterns is losing the signal in the noise. Patterns are things that software can find easily. We need to leverage the work being put into network analytics systems to help us track down the issues before they blow up into full problems. We need to apply the thing that makes network professionals the best suited to look at the best information we can gather about a situation. Our concentration on what matters is where our job will be in five years. Let’s take the knowledge we have and apply it.

Drowning in the Data of Things

DrowningSign

If you saw the news coming out of Cisco Live Berlin, you probably noticed that Internet of Things (IoT) was in every other announcement. I wrote about the impact of the new Digital Ceiling initiative already, but I think that IoT is a bit deeper than that. The other thing that seems to go hand in hand with discussion of IoT is big data. And for most of us, that big data is going to be a big problem.

Seen And Not Heard

Internet of Things is about dumb devices getting smart. Think Flowers for Algernon. Only now, instead of them just being smarter they are also going to be very talkative too. The amount of data that these devices used to hold captive will be unleashed on something. We assume that the data is going to be sent to a central collection point or polled from the device by an API call or a program that is mining the data for another party. But do you know who isn’t going to be getting that data? Us.

IoT devices are going to be talking to providers and data collection systems and, in a lot of cases, each other. But they aren’t going to be talking directly to end users or IT staff. That’s because IoT is about making devices intelligent enough to start making their own decisions about things. Remember when SDN came out and everyone started talking about networks making determinations about forwarding paths and topology changes without human inputs? Remember David Meyer talking about network fragility?

Now imagine that’s not the network any more. Imagine it’s everything. Devices talking to other devices and making decisions without human inputs. IoT gives machines the ability to make a limited amount of decisions based on data inputs. Factory floor running a bit too hot for the milling machines to keep up? Talk to the environmental controls and tell it to lower the temperature by two degrees for the next four hours. Is the shelf in the fridge where the milk is stored getting close to the empty milk jug weight? Order more milk. Did a new movie come out on Netflix that meets your viewing guidelines? Add that movie to the queue and have the TV turn on when your phone enters the house geofence.

Think about those processes for moment. All of them are fairly easy conditional statements. If this, then do that. But conditional statements aren’t cut and dried. They require knowledge of constraints and combinations. And all that knowledge comes from data.

More Data, More Problems

All of that data needs to be collected somehow. That means transport networks are going to be stressed now that there are ten times more devices chatting on them. And a good chunk of those devices, especially in the consumer space, are going to be wireless. Hope your wireless network is ready for that challenge. That data is going to be transported to some data sink somewhere. As much as we would like to hope that it’s a collector on our network, the odds are much better that it’s an off-site collector. That means your WAN is about to be stressed too.

How about storing that data? If you are lucky enough to have an onsite collection system you’d better start buying drives for it now. This is a huge amount of data. Nimble Storage has been collecting analytics data from their storage arrays for a while now. Every four hours they collect more data than there are stars in the Milky Way. Makes you wonder where they keep it? And how long are they going to keep that data? Just like the crap in your attic that you swear you’re going to get around to using one day, big data and analytics platforms will keep every shred of information you want to keep for as long you want to have it taking up drive space.

And what about security? Yeah, that’s an even scarier thought. Realize that many of the breaches we’ve read about in the past months have been hackers having access to systems for extended periods of time and only getting caught after they have exfiltrated data from the system. Think about what might happen if a huge data sink is sitting around unprotected. Sure, terabytes worth of data may be noticed if someone tries to smuggle it out of the DLP device. But all it takes is a quick SQL query against the users tables for social security numbers, a program to transpose those numbers into letters to evade the DLP scanner, and you can just email the file to yourself. Script a change from letters back to numbers and you’ve got a gold mine that someone left unlocked and lying around. We may be concentrating on securing the data in flight right now, but even the best armored car does no good if you leave the bank vault door open.


Tom’s Take

This whole thing isn’t rain clouds and doom and gloom. IoT and Big Data represent a huge challenge for modern systems planning. We have the ability to unlock insight from devices that couldn’t tell us their secrets before. But we have to know how deep that pool will be before we dive in. We have to understand what these devices represent before we connect them. We don’t want our thermostats DDoSing our home networks any more than we want the milling machines on the factory floor coming to life and trying to find Sarah Connor. But the challenges we have with transporting, storing, and securing the data from IoT devices is no different than trying to program on punch cards or figure out how to download emails from across the country. Technology will give us the key to solve those challenges. Assuming we can keep our head above water.