Data Is Not The New Oil, It’s Nuclear Power

Big Data. I believe that one phrase could get millions in venture capital funding. I don’t even have to put a product with it. Just say it. And make no mistake about it: the rest of the world thinks so too. Data is “the new oil”. At least, according to some pundits. It’s a great headline making analogy that describes how data is driving business and controlling it can lead to an empire. But, data isn’t really oil. It’s nuclear power.

Black Gold, Texas Tea

Crude oil is a popular resource. Prized for a variety of uses, it is traded and sold as a commodity and refined into plastics, gasoline, and other essential items of modern convenience. Oil creates empires and causes global commerce to hinge on every turn of the market. Living in a state that is a big oil producer, the exploration and refining of oil has a big impact.

However, when compared to Big Data, oil isn’t the right metaphor. Much like oil, data needs to be refined before use. But oil can be refined into many different distinct things. Data can only be turned into information. Oil burns up when consumed. Aside from some smoke and a small amount of residuals oil all but disappears after it expends the energy trapped within. Data doesn’t disappear after being turned into information.

In fact, the biggest issue that I have with the entire “Data as Oil” argument is that oil doesn’t stick around. We don’t see massive pools of oil on the side of the road from spills. We don’t hear about our massive issues with oil disposal or securing our spent oil to prevent theft. People that treat data like oil are only looking at the refined product as the final form. They tend to forget that the raw form of data sticks around after the transformation. Most people will tell you that’s a good thing because you can run analytics and machine learning against static datasets and continue to derive value from it. But doesn’t make me think of oil at all.

Welcome To The Nuclear Age

In fact, Big Data reminds me most of a nuclear power plant. Much like oil, the initial form of radioactive material isn’t very useful. It radiates and creates a small amount of heat but not enough to run the steam generators in a power plant. Instead, you must bombard the uranium 235 pellets with neutrons to start a fission reaction. Once you have a sustained controllable reaction the amount of generated heat rises and creates the resource you need to power the rest of your machinery.

Much like data, nuclear fission reactions don’t do much without the proper infrastructure to harness them. Even after you transform you data into information you need to parse, categorize, and analyze it. The byproduct of the transformation is the critical part of the whole process.

Much like nuclear fuel rods, data stays in place for years. It continues to produce the resource after being modified and transformed. It sits around in the hopes that it can be useful until the day that it no longer serves its purpose. Data that is past the useful shelf life goes into a data warehouse when it will eventually be forgotten. Spent nuclear fuel rods are also eventually removed and placed somewhere where they can’t affect other things. Maybe it’s buried deep underground. Or shot into space. Or placed under enough concrete that they will never be found again.

The danger in data and in nuclear power is not what happens when everything goes right. Instead, it’s what happens when everything goes wrong. With nuclear power, wrong is a chain reaction meltdown. Or wrong could be improper disposal of waste. It could be a disaster at the plant or even a theft of fissile nuclear material from the plant. The fuel rods themselves are simultaneously our source of power and the source of our potential disaster.

Likewise, data that is just sitting around and stored improperly can lead to huge disasters. We’re only four years removed from Target’s huge data breach. And how many more are waiting out there to happen? It seems the time between data leaks is shrinking as more and more bad actors are finding ways to steal, manipulate, and appropriate data for their own ends. And, much like nuclear fuel rods, the methods of protecting the data are few compared to other fuel sources.

Data isn’t something that can easily be hidden or compacted. It needs to be readable to be useful. It needs to be fast to be useful. All the things that make it easy to use also make it easy to exploit. And once we’re done with it, the way that it is stored in perpetuity only increase the likelihood of it being used improperly. Unless we’re willing to bury it under metaphorical concrete we’re in for a bad future if we forget how to handle spent data.


Tom’s Take

Data as Oil is a stupid metaphor. It’s meant to impress upon finance CEOs and Wall Street wonks how important it is for data to be taken seriously. Data as Oil is something a data scientist would say to get a job. By drawing a bad comparison, you make data seem like a commodity to be traded and used as collateral for empire building. It’s not. It’s a ticking bomb of disastrous proportions when not handled correctly. Rather than coming up with a pithy metaphor for cable news consumption and page views, let’s treat data with the respect it deserves and make sure we plan for how we’re going to deal with something that won’t burn up into smoke whenever it’s convenient.

Devaluing Data Exposures

I had a great time this week recording the first episode of a new series with my co-worker Rich Stroffolino. The Gestalt IT Rundown is hopefully the start of some fun news stories with a hint of snark and humor thrown in.

One of the things I discussed in this episode was my belief that no data is truly secure any more. Thanks to recent attacks like WannaCry and Bad Rabbit and the rise of other state-sponsored hacking and malware attacks, I’m totally behind the idea that soon everyone will know everything about me and there’s nothing that anyone can do about it.

Just Pick Up The Phone

Personal data is important. Some pieces of personal data are sacrificed for the greater good. Anyone who is in IT or works in an area where they deal with spam emails and robocalls has probably paused for a moment before putting contact information down on a form. I have an old Hotmail address I use to catch spam if I’m relative certain that something looks shady. I give out my home phone number freely because I never answer it. These pieces of personal data have been sacrificed in order to provide me a modicum of privacy.

But what about other things that we guard jealously? How about our mobile phone number. When I worked for a VAR that was the single most secretive piece of information I owned. No one, aside from my coworkers, had my mobile number. In part, it’s because I wanted to make sure that it got used properly. But also because I knew that as soon as one person at the customer site had it, soon everyone would. I would be spending my time answering phone calls instead of working on tickets.

That’s the world we live in today. So many pieces of information about us are being stored. Our Social Security Number, which has truthfully been misappropriated as an identification number. US Driver’s Licenses, which are also used as identification. Passport numbers, credit ratings, mother’s maiden name (which is very handy for opening accounts in your name). The list could be a blog post in and of itself. But why is all of this data being stored?

Data Is The New Oil

The first time I heard someone in a keynote use the phrase “big data is the new oil”, I almost puked. Not because it’s a platitude the underscores the value of data. I lost it because I know what people do with vital resources like oil, gold, and diamonds. They horde them. Stockpiling the resources until they can be refined. Until every ounce of value can be extracted. Then the shell is discarded until it becomes a hazard.

Don’t believe me? I live in a state that is legally required to run radio and television advertisements telling children not to play around old oilfield equipment that hasn’t been operational in decades. It’s cheaper for them to buy commercials than it is to clean up their mess. And that precious resource? It’s old news. Companies that extract resources just move on to the next easy source instead of cleaning up their leftovers.

Why does that matter to you? Think about all the pieces of data that are stored somewhere that could possibly leak out about you. Phone numbers, date of birth, names of children or spouses. And those are the easy ones. Imagine how many places your SSN is currently stored. Now, imagine half of those companies go out of business in the next three years. What happens to your data then? You can better believe that it’s not going to get destroyed or encrypted in such a way as to prevent exposure. It’s going to lie fallow on some forgotten server until someone finds it and plunders it. Your only real hope is that it was being stored on a cloud provider that destroys the storage buckets after the bill isn’t paid for six months.

Devaluing Data

How do we fix all this? Can this be fixed? Well, it might be able to be done, but it’s not going to be fun, cheap, or easy. It all starts by making discrete data less valuable. An SSN is worthless without a name attached to it, for instance. If all I have are 9 random numbers with no context I can’t tell what they’re supposed to be. The value only comes when those 9 numbers can be matched to a name.

We’ve got to stop using SSN as a unique identifier for a person. It was never designed for that purpose. In fact, storing SSN as all is a really bad idea. Users should be assigned a new, random ID number when creating an account or filling out a form. SSN shouldn’t be stored unless absolutely necessary. And when it is, it should be treated like a nuclear launch code. It should take special authority to query it, and the database that queries it should be directly attached to anything else.

Critical data should be stored in a vault that can only be accessed in certain ways and never exposed. A prime example is the trusted enclave in an iPhone. This enclave, when used for TouchID or FaceID, stores your fingerprints and your face map. Pretty important stuff, yes? However, even with biometric ID systems become more prevalent there isn’t any way to extract that data from the enclave. It’s stored in such a way that it can only be queried in a specific manner and a result of yes/no returned from the query. If you stole my iPhone tomorrow, there’s no way for you to reconstruct my fingerprints from it. That’s the template we need to use going forward to protect our data.


Tom’s Take

I’m getting tired of being told that my data is being spread to the four winds thanks to it lying around waiting to be used for both legitimate and nefarious purposes. We can’t build fences high enough around critical data to keep it from being broken into. We can’t keep people out, so we need to start making the data less valuable. Instead of keeping it all together where it can be reconstructed into something of immense value, we need to make it hard to get all the pieces together at any one time. That means it’s going to be tough for us to build systems that put it all together too. But wouldn’t you rather spend your time solving a fun problem like that rather than making phone calls telling people your SSN got exposed on the open market?

Don’t Build Big Data With Bad Data

I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.


Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

Drowning in the Data of Things

DrowningSign

If you saw the news coming out of Cisco Live Berlin, you probably noticed that Internet of Things (IoT) was in every other announcement. I wrote about the impact of the new Digital Ceiling initiative already, but I think that IoT is a bit deeper than that. The other thing that seems to go hand in hand with discussion of IoT is big data. And for most of us, that big data is going to be a big problem.

Seen And Not Heard

Internet of Things is about dumb devices getting smart. Think Flowers for Algernon. Only now, instead of them just being smarter they are also going to be very talkative too. The amount of data that these devices used to hold captive will be unleashed on something. We assume that the data is going to be sent to a central collection point or polled from the device by an API call or a program that is mining the data for another party. But do you know who isn’t going to be getting that data? Us.

IoT devices are going to be talking to providers and data collection systems and, in a lot of cases, each other. But they aren’t going to be talking directly to end users or IT staff. That’s because IoT is about making devices intelligent enough to start making their own decisions about things. Remember when SDN came out and everyone started talking about networks making determinations about forwarding paths and topology changes without human inputs? Remember David Meyer talking about network fragility?

Now imagine that’s not the network any more. Imagine it’s everything. Devices talking to other devices and making decisions without human inputs. IoT gives machines the ability to make a limited amount of decisions based on data inputs. Factory floor running a bit too hot for the milling machines to keep up? Talk to the environmental controls and tell it to lower the temperature by two degrees for the next four hours. Is the shelf in the fridge where the milk is stored getting close to the empty milk jug weight? Order more milk. Did a new movie come out on Netflix that meets your viewing guidelines? Add that movie to the queue and have the TV turn on when your phone enters the house geofence.

Think about those processes for moment. All of them are fairly easy conditional statements. If this, then do that. But conditional statements aren’t cut and dried. They require knowledge of constraints and combinations. And all that knowledge comes from data.

More Data, More Problems

All of that data needs to be collected somehow. That means transport networks are going to be stressed now that there are ten times more devices chatting on them. And a good chunk of those devices, especially in the consumer space, are going to be wireless. Hope your wireless network is ready for that challenge. That data is going to be transported to some data sink somewhere. As much as we would like to hope that it’s a collector on our network, the odds are much better that it’s an off-site collector. That means your WAN is about to be stressed too.

How about storing that data? If you are lucky enough to have an onsite collection system you’d better start buying drives for it now. This is a huge amount of data. Nimble Storage has been collecting analytics data from their storage arrays for a while now. Every four hours they collect more data than there are stars in the Milky Way. Makes you wonder where they keep it? And how long are they going to keep that data? Just like the crap in your attic that you swear you’re going to get around to using one day, big data and analytics platforms will keep every shred of information you want to keep for as long you want to have it taking up drive space.

And what about security? Yeah, that’s an even scarier thought. Realize that many of the breaches we’ve read about in the past months have been hackers having access to systems for extended periods of time and only getting caught after they have exfiltrated data from the system. Think about what might happen if a huge data sink is sitting around unprotected. Sure, terabytes worth of data may be noticed if someone tries to smuggle it out of the DLP device. But all it takes is a quick SQL query against the users tables for social security numbers, a program to transpose those numbers into letters to evade the DLP scanner, and you can just email the file to yourself. Script a change from letters back to numbers and you’ve got a gold mine that someone left unlocked and lying around. We may be concentrating on securing the data in flight right now, but even the best armored car does no good if you leave the bank vault door open.


Tom’s Take

This whole thing isn’t rain clouds and doom and gloom. IoT and Big Data represent a huge challenge for modern systems planning. We have the ability to unlock insight from devices that couldn’t tell us their secrets before. But we have to know how deep that pool will be before we dive in. We have to understand what these devices represent before we connect them. We don’t want our thermostats DDoSing our home networks any more than we want the milling machines on the factory floor coming to life and trying to find Sarah Connor. But the challenges we have with transporting, storing, and securing the data from IoT devices is no different than trying to program on punch cards or figure out how to download emails from across the country. Technology will give us the key to solve those challenges. Assuming we can keep our head above water.

 

Time For A Data Diet?

Embed from Getty Images

I’m running out of drive space. Not just on my laptop SSD or my desktop HDD. But everywhere. The amount of data that I’m storing now is climbing at an alarming rate. What’s worse is that I often forget I have some of it until I go spelunking back through my drive to figure out what’s taking up all that room. And it’s a problem that the industry is facing too.

The Data Junkyard

Data is accumulating. You can’t deny that. Two factors have lead to this. The first is that we now log more data from things than ever before. In this recent post from Chris Evans (@ChrisMEvans), he mentions that Virgin Atlantic 787s are generating 500GB of data per flight. I’m sure that includes telemetry, aircraft performance, and other debugging information that someone at some point deemed crucial. In another recent article from Jacques Mattheij (@JMattheij), he mentions that app developers left the debug logging turned on, generating enormous data files as the system was in operation.

Years ago we didn’t have the space to store that much data. We had to be very specific about what needed to be capture and stored for long periods of time. I can remember having a 100MB hard drive in my first computer. I can also remember uninstalling and deleting several things in order to put a new program on. Now there is so much storage space that we don’t worry about running out unless a new application makes outrageous demands.

You Want To Use The Data?

The worst part about all this data accumulation is that once it’s been stored, no one ever looks at it again. This isn’t something that’s specific to electronic data, though. I can remember seeing legal offices with storage closets dedicated to boxes full of files. Hospitals have services that deal with medical record storage. In the old days, casinos hired vans to shuffle video tapes back and forth between vaults and security offices. All that infrastructure just on the off-chance that you might need the data one day.

With Big Data being a huge funding target and buzzword source today, you can imagine that every other startup in the market is offering to give you some insight into all that data that you’re storing. I’ve talked before about the drive for analysis of data. It’s the end result of companies trying to make sense of the chaos. But what about the stored data?

Odds are good that it’s going to just sit there in perpetuity. Once the analysis is done on all this data, it will either collect dust in a virtual file box until it is needed again (perhaps) in the far future or it will survive until the next SAN outage and never be reconstructed from backup. The funny thing about this collected data cruft is that no one misses it until the subpoena comes.

Getting Back To Fighting Weight

The solution to the problem isn’t doing more analysis on data. Instead, we need to start being careful about what data we’re storing in the first place. When you look at personal systems like Getting Things Done, they focus on stemming the flow of data quickly to give people more time to look at the important things. In much the same way, instead of capturing every bit coming from a data source and deciding later what to do with it, the decision needs to be made right away. Data Scientists need to start thinking like they’re on a storage budget, not like they’ve been handed the keys to the SAN kingdom.

I would be willing to bet that a few discrete decisions in the data collection process about what to keep and what to throw away would significantly cut down on the amount data we need to store and process. Less time spent querying and searching through that mess would optimize data retrieval systems and make our infrastructure run much faster. Think of it like spring cleaning for the data garage.


Tom’s Take

I remember a presentation at Networking Field Day a few years ago when Statseeker told us that they could scan data points from years in the past down to the minute. The room collectively gasped. How could you look that far back? How big are the drives in your appliance? The answer was easy: they don’t store every little piece of data coming from the system. They instead look at very specific things that tell them about the network and then record those with an eye to retrieval in the future. They optimize at storage time to help the impact of lookup in the future.

Rather than collecting everything in the world in the hopes that it might be useful, we need to get away from the data hoarding mentality and trim down to something more agile. It’s the only way our data growth problem is going to get better in the near future.


If you’d like to hear some more thoughts on the growing data problem, be sure to check out the Tech Talk sponsored by Fusion-io.

 

Replacing Nielsen With Big Data

Embed from Getty Images

I’m a fan of television. I like watching interesting programs. It’s been a few years since one has caught my attention long enough to keep my watching for multiple seasons. Part of that reason is due to my fear that an awesome program is going to fail to reach a “target market” and will end up canceled just when it’s getting good. It’s happened to several programs that I liked in the past.

Sampling the Goods

Part of the issue with tracking the popularity of television programs comes from the archaic method in which programs are measured. Almost everyone has heard of the Nielsen ratings system. This sampling method was created in the 1930s as a way to measure radio advertising reach. In the 50s, it was adapted for use in television.

Nielsen selects target audiences that represent the greater whole. They ask users to keep written diaries of their television watching habits. They also have the ability to install a device called a set meter which allows viewers to punch in a code to identify themselves via age groups and lock in a view for a program. The set meter can tell the instant a channel is changed or the TV is powered off.

In theory, the sampling methodology is sound. In practice, it’s a bit shaky. View diaries are unreliable because people tend to overreport their view habits. If they feel guilty that they haven’t been writing anything down, they tend to look up something that was on TV and write it down. Diaries also can’t determine if a viewer watched the entire program or changed the channel in the middle. Set meters aren’t much better. The reliance on PIN codes to identify users can lead to misreported results. Adults in a hurry will sometimes punch in an easier code assigned to their children, leading to skewed age results.

Both the diary and the set meter fail to take into account the shift in viewing habits in modern households. Most of the TV viewing in my house takes place through time-shifted DVR recordings. My kids tend to monopolize the TV during the day, but even they are moving to using services like Netflix to watch all episodes of their favorite cartoons in one sitting. Neither of these viewing habits are easily tracked by Nielsen.

How can we find a happy medium? Sample sizes have been reduced signifcantly due to cord-cutting households moving to Internet distribution models. People tend to exaggerate or manipulate self-reported viewing results. Even “modern” Nielsen technology can’t keep up. What’s the answer?

Big Data

I know what you’re saying: “We’ve already got that with Nielsen, right?” Not quite. TV viewing habits have shifted in the past few years. So has TV technology. Thanks to the shift from analog broadcast signals to digital and the explosion of set top boxes for cable decryption and movie service usage, we now have a huge portal into the living room of every TV watcher in the world.

Think about for a moment. The idea of a sample size works provided it’s a good representative sample. But tracking this data is problematic. If we have access to a way to crunch the actual data instead of extrapolating from incomplete sets shouldn’t we use that instead? I’d rather believe the real numbers instead of trying to guess from unreliable sources.

This also fixes the issue of time-shifted viewing. Those same set top boxes are often responsible for recording the programs. They could provide information such as number of shows recorded versus viewed and whether or not viewers skip through commercials. For those that view on mobile devices that data could be compiled as well through integration with the set top box. User logins are required for mobile apps as it is today. It’s just a small step to integrating the whole package.

It would require a bit of technical upgrading on the client side. We would have to enable the set top boxes to report data back to a service. We could anonymize the data to a point to be sure that people aren’t being unnecessarily exposed. It will also have to be configured as an opt-out setting to ensure that the majority is represented. Opt-in won’t work because those checkboxes never get checked.

Advertisers are going to demand the most specific information about people that they can. The ratings service exists to work for the advertisers. If this plan is going to work, a new company will have to be created to collect and analyze this data. This way the analysis company can ensure that the data is specific enough to be of use to the advertisers while at the same time ensuring the protection of the viewers.


Tom’s Take

Every year, promising new TV shows are yanked off the airwaves because advertisers don’t see any revenue. Shows that have a great premise can’t get up to steam because of ratings. We need to fix this system. In the old days, the deluge of data would have drown Nielsen. Today, we have the technology to collect, analyze, and store that data for eternity. We can finally get the real statistics on how many people watched Jericho or After MASH. Armed with real numbers, we can make intelligent decisions about what to keep on TV and what to jettison. And that’s a big data project I’d be willing to watch.

Big Data? Or Big Analysis?

data-illustration

Unless you’ve been living under a rock for the past few years, you’ve no doubt heard all about the problem that we have with big data.  When you start crunching the numbers on data sets in the terabyte range the amount of compute power and storage space that you have to dedicate to the endeavor is staggering.  Even at Dell Enterprise Forum some of the talk in the keynote addresses focused on the need to split the processing of big data down into more manageable parallel sets via use of new products such as the VRTX.  That’s all well and good.  That is, it’s good if you actually believe the problem is with the data in the first place.

Data Vs. Information

Data is just description.  It’s a raw material.  It’s no more useful to the average person than cotton plants or iron ore.  Data is just a singular point on a graph with no axes.  Nothing can be inferred from that data point unless you process it somehow.  That’s where we start talking about information.

Information is the processed form of data.  It’s digestible and coherent.  It’s a collection of data points that tell a story or support a hypothesis.  Information is actionable data.  When I have information on something, I can make a decision or present my findings to someone to make a decision.  They key is that it is a second-order product.  Information can’t exist without data upon which to perform some kind of analysis.  And therein lies the problem in our growing “big data” conundrum.

Big Analysis

Data is very sedentary.  It doesn’t really do much after it’s collected.  It may sit around in a database for a few days until someone needs to generate information from it.  That’s where analysis comes into play.  A table is just a table.  It has a height and a width.  It has a composition.  That’s data.  But when we analyze that table, we start generating all kinds of additional information about it.  Is it comfortable to sit at the table?  What color lamp goes best with it?  Is it hard to move across the floor?  Would it break if I stood on it?  All of that analysis is generated from the data at hand.  The data didn’t go anywhere or do anything.  I created all that additional information solely from the data.

Look at the above Wikipedia entry for big data.  The image on the screen is one of the better examples of information spiraling out of control from analysis of a data set.  The picture is a visual example of Wikipedia edits.  Note that it doesn’t have anything to do with the data contained in a particular entry.  They’re just tracking what people did to describe that data or how they analyzed it.  We’ve generated terabytes of information just doing change tracking on a data set.  All that data needs to be stored somewhere.  That’s what has people in IT sales salivating.

Guilt By Association

If you want to send a DBA screaming into the night, just mention the words associative entity (or junction table).  In another lifetime, I was in college to become a DBA.  I went through Intro to Databases and learned about all the constructs that we use to contain data sets.  I might have even learned a little SQL by accident.  What I remember most was about entities.  Standard entities are regular data.  They have a primary key that describes a row of data, such as a person or a vehicle.  That data is pretty static and doesn’t change often.  Case in point – how accurate is the height and weight entry on your driver’s license?

Associative entities, on the other hand, represent borderline chaos.  These are analysis nodes.  They contain more than one primary key as a reference to at least two tables in a database.  They are created when you are trying to perform some kind of analysis on those tables.  They can be ephemeral and usually are generated on demand by things like SQL queries.  This is the heart of my big data / big analysis issue.  We don’t really care about the standard data entities.  We only want the analysis and information that we get from the associative entities.  The more information and analysis we desire, the more of these associative entities we create.  Containing these descriptive sets is causing the explosion in storage and compute costs.  The data hasn’t really grown.  It’s our take on the data that has.

Crunch Time

What can we do?  Sadly, not much.  Our brains are hard-wired to try and make patterns out of seeming unconnected things.  It is a natural reaction that we try to bring order to chaos.  Given all of the data in the world the first thing we are going to want to do with it is try and make sense of it.  Sure, we’ve found some very interesting underlying patterns through analysis such as the well-worn story from last year of Target determining a girl was pregnant before her family knew.  The purpose of all that analysis was pretty simple – Target wanted to know how to better pitch products to a specific focus groups of people.  They spent years of processing time and terabytes of storage all for the lofty goal of trying to figure out what 18-24 year old males are more likely to buy during the hours of 6 p.m. to 10 p.m. on weekday evening.  It’s a key to the business models of the future.  Rather than guessing what people want, we have magical reports that tell us exactly what they want.  Why do you think Facebook is so attached to the idea of “liking” things?  That’s an advertisers dream.  Getting your hands on a second-order analysis of Facebook’s Like database would be the equivalent of the advertising Holy Grail.


Tom’s Take

We are never going to stop creating analysis of data.  Sure, we may run out of things to invent or see or do, but we will never run out of ways to ask questions about them.  As long as pivot tables exist in Excel or inner joins happen in an Oracle database people are going to be generating analysis of data for the sake of information.  We may reach a day where all that information finally buries us under a mountain of ones and zeroes.  We brought it on ourselves because we couldn’t stop asking questions about buying patterns or traffic behaviors.  Maybe that’s the secret to Zen philosophy after all.  Instead of concentrating on the analysis of everything, just let the data be.  Sometimes just existing is enough.