Unless you’ve been living under a rock for the past few years, you’ve no doubt heard all about the problem that we have with big data. When you start crunching the numbers on data sets in the terabyte range the amount of compute power and storage space that you have to dedicate to the endeavor is staggering. Even at Dell Enterprise Forum some of the talk in the keynote addresses focused on the need to split the processing of big data down into more manageable parallel sets via use of new products such as the VRTX. That’s all well and good. That is, it’s good if you actually believe the problem is with the data in the first place.
Data Vs. Information
Data is just description. It’s a raw material. It’s no more useful to the average person than cotton plants or iron ore. Data is just a singular point on a graph with no axes. Nothing can be inferred from that data point unless you process it somehow. That’s where we start talking about information.
Information is the processed form of data. It’s digestible and coherent. It’s a collection of data points that tell a story or support a hypothesis. Information is actionable data. When I have information on something, I can make a decision or present my findings to someone to make a decision. They key is that it is a second-order product. Information can’t exist without data upon which to perform some kind of analysis. And therein lies the problem in our growing “big data” conundrum.
Data is very sedentary. It doesn’t really do much after it’s collected. It may sit around in a database for a few days until someone needs to generate information from it. That’s where analysis comes into play. A table is just a table. It has a height and a width. It has a composition. That’s data. But when we analyze that table, we start generating all kinds of additional information about it. Is it comfortable to sit at the table? What color lamp goes best with it? Is it hard to move across the floor? Would it break if I stood on it? All of that analysis is generated from the data at hand. The data didn’t go anywhere or do anything. I created all that additional information solely from the data.
Look at the above Wikipedia entry for big data. The image on the screen is one of the better examples of information spiraling out of control from analysis of a data set. The picture is a visual example of Wikipedia edits. Note that it doesn’t have anything to do with the data contained in a particular entry. They’re just tracking what people did to describe that data or how they analyzed it. We’ve generated terabytes of information just doing change tracking on a data set. All that data needs to be stored somewhere. That’s what has people in IT sales salivating.
Guilt By Association
If you want to send a DBA screaming into the night, just mention the words associative entity (or junction table). In another lifetime, I was in college to become a DBA. I went through Intro to Databases and learned about all the constructs that we use to contain data sets. I might have even learned a little SQL by accident. What I remember most was about entities. Standard entities are regular data. They have a primary key that describes a row of data, such as a person or a vehicle. That data is pretty static and doesn’t change often. Case in point – how accurate is the height and weight entry on your driver’s license?
Associative entities, on the other hand, represent borderline chaos. These are analysis nodes. They contain more than one primary key as a reference to at least two tables in a database. They are created when you are trying to perform some kind of analysis on those tables. They can be ephemeral and usually are generated on demand by things like SQL queries. This is the heart of my big data / big analysis issue. We don’t really care about the standard data entities. We only want the analysis and information that we get from the associative entities. The more information and analysis we desire, the more of these associative entities we create. Containing these descriptive sets is causing the explosion in storage and compute costs. The data hasn’t really grown. It’s our take on the data that has.
What can we do? Sadly, not much. Our brains are hard-wired to try and make patterns out of seeming unconnected things. It is a natural reaction that we try to bring order to chaos. Given all of the data in the world the first thing we are going to want to do with it is try and make sense of it. Sure, we’ve found some very interesting underlying patterns through analysis such as the well-worn story from last year of Target determining a girl was pregnant before her family knew. The purpose of all that analysis was pretty simple – Target wanted to know how to better pitch products to a specific focus groups of people. They spent years of processing time and terabytes of storage all for the lofty goal of trying to figure out what 18-24 year old males are more likely to buy during the hours of 6 p.m. to 10 p.m. on weekday evening. It’s a key to the business models of the future. Rather than guessing what people want, we have magical reports that tell us exactly what they want. Why do you think Facebook is so attached to the idea of “liking” things? That’s an advertisers dream. Getting your hands on a second-order analysis of Facebook’s Like database would be the equivalent of the advertising Holy Grail.
We are never going to stop creating analysis of data. Sure, we may run out of things to invent or see or do, but we will never run out of ways to ask questions about them. As long as pivot tables exist in Excel or inner joins happen in an Oracle database people are going to be generating analysis of data for the sake of information. We may reach a day where all that information finally buries us under a mountain of ones and zeroes. We brought it on ourselves because we couldn’t stop asking questions about buying patterns or traffic behaviors. Maybe that’s the secret to Zen philosophy after all. Instead of concentrating on the analysis of everything, just let the data be. Sometimes just existing is enough.