AI Is Making Data Cost Too Much

Posted on November 7, 2023 by networkingnerd

You may recall that I wrote a piece almost six years ago comparing big data to nuclear power. Part of the purpose of that piece was to knock the wind out of the “data is oil” comparisons that were so popular. Today’s landscape is totally different now thanks to the shifts that the IT industry has undergone in the past few years. I now believe that AI is going to cause a massive amount of wealth transfer away from the AI companies and cause startup economics to shift.

Can AI Really Work for Enterprises?

In this episode of Packet Pushers, Greg Ferro and Brad Casemore debate a lot of topics around the future of networking. One of the things that Brad brought up that Greg pointed out is that data being used for AI algorithm training is being stored in the cloud. That massive amount of data is sitting there waiting to be used between training runs and it’s costing some AI startups a fortune in cloud costs.

AI algorithms need to be trained to be useful. When someone uses ChatGPT to write a term paper or ask nonsensical questions you’re using the output of the GPT training run. The real work happens when OpenAI is crunching data and feeding their monster. They have to give it a set of parameters and data to analyze in order to come up with the magic that you see in the prompt window. That data doesn’t just come out of nowhere. It has to be compiled and analyzed.

There are a lot of creators of content that are angry that their words are being fed into the GPT algorithm runs and then being used in the results without giving proper credit. That means that OpenAI is scraping the content from the web and feeding it into the algorithm without care for what they’re looking it. It also creates issues where the validity and the accuracy of the data isn’t verified ahead of time.

Now, this focuses on OpenAI and GPT specifically because everyone seems to think that’s AI right now. Much like every solution in the history of IT, GPT-based large language models (LLMs) are just a stepping stone along the way to greater understanding of what AI can do. The real value for organizations, as Greg pointed out in the podcast, can be something as simple as analyzing the trouble ticket a user has submitted and then offering directed questions to help clarify the ticket for the help desk so they spend less time chasing false leads.

No Free Lunchboxes

Where are organizations going to store that data? In the old days it was going to be collected in on-prem storage arrays that weren’t being used for anything else. The opportunity cost of using something you already owned was minimal. After all, you bought the capacity so why not use it? Organizations that took this approach decided to just save every data point they could find in an effort to “mine” for insights later. Hence the references to oil and other natural resources.

Today’s world is different. LLMs need massive resources to run. Unless you’re willing to drop several million dollars to build out your own cluster resources and hire engineers to keep the running at peak performance you’re probably going to be using a hosted cloud solution. That’s easy enough to set up and run. And you’re only paying for what you use. CPU and GPU times are important so you want the job to complete as fast as possible in order to keep your costs low.

What about the data that you need to feed to the algorithm? Are you going to feed it from your on-prem storage? That’s way too slow, even with super fast WAN links. You need to get the data as close to the processors as possible. That means you need to migrate it into the cloud. You need to keep it there while the magic AI building machine does the work. Are you going to keep that valuable data in the cloud, incurring costs every hour it’s stored there? Or are you going to pay to have it moved back to your enterprise? Either way the sound of a cash register is deafening to your finance department and music to the ears of cloud providers and storage vendors selling them exabytes of data storage.

All those hopes of making tons of money from your AI insights are going to evaporate in a pile of cloud bills. The operations costs of keeping that data are now more than minimal. If you want to have good data to operate on you’re going to need to keep it. And if you can’t keep it locally in your organization you’re going to have to pay someone to keep it for you. That means writing big checks to the cloud providers that have effectively infinite storage, bounded only by the limit on your credit card or purchase order. That kind of wealth transfer makes investors seem a bit hesitant when they aren’t going to get the casino-like payouts they’d been hoping for.

The shift will cause AI startups to be very frugal in what they keep. They will either amass data only when they think their algorithm is ready for a run or keep only critical data that they know they’re going to need to feed the monster. That means they’re going to be playing a game with the accuracy of the resulting software as well as giving up chances that some insignificant piece of data ends up being the key to a huge shift. In essence, the software will all start looking and sounding the same after a while and there won’t be enough differentiation to make they competitive because no one will be able to afford it.

Tom’s Take

The relative ease with which data could be stored turned companies into data hoarders. They kept it forever hoping they could get some value out of it and create a return curve that soared to the moon. Instead, the application for that data mining finally came along and everyone realized that getting the value out of the data meant investing even more capital into refining it. That kind of investment makes those curves much flatter and makes investors more reluctant. That kind of shift means more work and less astronomical payout. All because your resources were more costly than you first thought.

Data Is Not The New Oil, It’s Nuclear Power

Posted on December 20, 2017 by networkingnerd

Big Data. I believe that one phrase could get millions in venture capital funding. I don’t even have to put a product with it. Just say it. And make no mistake about it: the rest of the world thinks so too. Data is “the new oil”. At least, according to some pundits. It’s a great headline making analogy that describes how data is driving business and controlling it can lead to an empire. But, data isn’t really oil. It’s nuclear power.

Black Gold, Texas Tea

Crude oil is a popular resource. Prized for a variety of uses, it is traded and sold as a commodity and refined into plastics, gasoline, and other essential items of modern convenience. Oil creates empires and causes global commerce to hinge on every turn of the market. Living in a state that is a big oil producer, the exploration and refining of oil has a big impact.

However, when compared to Big Data, oil isn’t the right metaphor. Much like oil, data needs to be refined before use. But oil can be refined into many different distinct things. Data can only be turned into information. Oil burns up when consumed. Aside from some smoke and a small amount of residuals oil all but disappears after it expends the energy trapped within. Data doesn’t disappear after being turned into information.

In fact, the biggest issue that I have with the entire “Data as Oil” argument is that oil doesn’t stick around. We don’t see massive pools of oil on the side of the road from spills. We don’t hear about our massive issues with oil disposal or securing our spent oil to prevent theft. People that treat data like oil are only looking at the refined product as the final form. They tend to forget that the raw form of data sticks around after the transformation. Most people will tell you that’s a good thing because you can run analytics and machine learning against static datasets and continue to derive value from it. But doesn’t make me think of oil at all.

Welcome To The Nuclear Age

In fact, Big Data reminds me most of a nuclear power plant. Much like oil, the initial form of radioactive material isn’t very useful. It radiates and creates a small amount of heat but not enough to run the steam generators in a power plant. Instead, you must bombard the uranium 235 pellets with neutrons to start a fission reaction. Once you have a sustained controllable reaction the amount of generated heat rises and creates the resource you need to power the rest of your machinery.

Much like data, nuclear fission reactions don’t do much without the proper infrastructure to harness them. Even after you transform you data into information you need to parse, categorize, and analyze it. The byproduct of the transformation is the critical part of the whole process.

Much like nuclear fuel rods, data stays in place for years. It continues to produce the resource after being modified and transformed. It sits around in the hopes that it can be useful until the day that it no longer serves its purpose. Data that is past the useful shelf life goes into a data warehouse when it will eventually be forgotten. Spent nuclear fuel rods are also eventually removed and placed somewhere where they can’t affect other things. Maybe it’s buried deep underground. Or shot into space. Or placed under enough concrete that they will never be found again.

The danger in data and in nuclear power is not what happens when everything goes right. Instead, it’s what happens when everything goes wrong. With nuclear power, wrong is a chain reaction meltdown. Or wrong could be improper disposal of waste. It could be a disaster at the plant or even a theft of fissile nuclear material from the plant. The fuel rods themselves are simultaneously our source of power and the source of our potential disaster.

Likewise, data that is just sitting around and stored improperly can lead to huge disasters. We’re only four years removed from Target’s huge data breach. And how many more are waiting out there to happen? It seems the time between data leaks is shrinking as more and more bad actors are finding ways to steal, manipulate, and appropriate data for their own ends. And, much like nuclear fuel rods, the methods of protecting the data are few compared to other fuel sources.

Data isn’t something that can easily be hidden or compacted. It needs to be readable to be useful. It needs to be fast to be useful. All the things that make it easy to use also make it easy to exploit. And once we’re done with it, the way that it is stored in perpetuity only increase the likelihood of it being used improperly. Unless we’re willing to bury it under metaphorical concrete we’re in for a bad future if we forget how to handle spent data.

Tom’s Take

Data as Oil is a stupid metaphor. It’s meant to impress upon finance CEOs and Wall Street wonks how important it is for data to be taken seriously. Data as Oil is something a data scientist would say to get a job. By drawing a bad comparison, you make data seem like a commodity to be traded and used as collateral for empire building. It’s not. It’s a ticking bomb of disastrous proportions when not handled correctly. Rather than coming up with a pithy metaphor for cable news consumption and page views, let’s treat data with the respect it deserves and make sure we plan for how we’re going to deal with something that won’t burn up into smoke whenever it’s convenient.

Devaluing Data Exposures

Posted on October 27, 2017 by networkingnerd

I had a great time this week recording the first episode of a new series with my co-worker Rich Stroffolino. The Gestalt IT Rundown is hopefully the start of some fun news stories with a hint of snark and humor thrown in.

One of the things I discussed in this episode was my belief that no data is truly secure any more. Thanks to recent attacks like WannaCry and Bad Rabbit and the rise of other state-sponsored hacking and malware attacks, I’m totally behind the idea that soon everyone will know everything about me and there’s nothing that anyone can do about it.

Just Pick Up The Phone

Personal data is important. Some pieces of personal data are sacrificed for the greater good. Anyone who is in IT or works in an area where they deal with spam emails and robocalls has probably paused for a moment before putting contact information down on a form. I have an old Hotmail address I use to catch spam if I’m relative certain that something looks shady. I give out my home phone number freely because I never answer it. These pieces of personal data have been sacrificed in order to provide me a modicum of privacy.

But what about other things that we guard jealously? How about our mobile phone number. When I worked for a VAR that was the single most secretive piece of information I owned. No one, aside from my coworkers, had my mobile number. In part, it’s because I wanted to make sure that it got used properly. But also because I knew that as soon as one person at the customer site had it, soon everyone would. I would be spending my time answering phone calls instead of working on tickets.

That’s the world we live in today. So many pieces of information about us are being stored. Our Social Security Number, which has truthfully been misappropriated as an identification number. US Driver’s Licenses, which are also used as identification. Passport numbers, credit ratings, mother’s maiden name (which is very handy for opening accounts in your name). The list could be a blog post in and of itself. But why is all of this data being stored?

Data Is The New Oil

The first time I heard someone in a keynote use the phrase “big data is the new oil”, I almost puked. Not because it’s a platitude the underscores the value of data. I lost it because I know what people do with vital resources like oil, gold, and diamonds. They horde them. Stockpiling the resources until they can be refined. Until every ounce of value can be extracted. Then the shell is discarded until it becomes a hazard.

Don’t believe me? I live in a state that is legally required to run radio and television advertisements telling children not to play around old oilfield equipment that hasn’t been operational in decades. It’s cheaper for them to buy commercials than it is to clean up their mess. And that precious resource? It’s old news. Companies that extract resources just move on to the next easy source instead of cleaning up their leftovers.

Why does that matter to you? Think about all the pieces of data that are stored somewhere that could possibly leak out about you. Phone numbers, date of birth, names of children or spouses. And those are the easy ones. Imagine how many places your SSN is currently stored. Now, imagine half of those companies go out of business in the next three years. What happens to your data then? You can better believe that it’s not going to get destroyed or encrypted in such a way as to prevent exposure. It’s going to lie fallow on some forgotten server until someone finds it and plunders it. Your only real hope is that it was being stored on a cloud provider that destroys the storage buckets after the bill isn’t paid for six months.

Devaluing Data

How do we fix all this? Can this be fixed? Well, it might be able to be done, but it’s not going to be fun, cheap, or easy. It all starts by making discrete data less valuable. An SSN is worthless without a name attached to it, for instance. If all I have are 9 random numbers with no context I can’t tell what they’re supposed to be. The value only comes when those 9 numbers can be matched to a name.

We’ve got to stop using SSN as a unique identifier for a person. It was never designed for that purpose. In fact, storing SSN as all is a really bad idea. Users should be assigned a new, random ID number when creating an account or filling out a form. SSN shouldn’t be stored unless absolutely necessary. And when it is, it should be treated like a nuclear launch code. It should take special authority to query it, and the database that queries it should be directly attached to anything else.

Critical data should be stored in a vault that can only be accessed in certain ways and never exposed. A prime example is the trusted enclave in an iPhone. This enclave, when used for TouchID or FaceID, stores your fingerprints and your face map. Pretty important stuff, yes? However, even with biometric ID systems become more prevalent there isn’t any way to extract that data from the enclave. It’s stored in such a way that it can only be queried in a specific manner and a result of yes/no returned from the query. If you stole my iPhone tomorrow, there’s no way for you to reconstruct my fingerprints from it. That’s the template we need to use going forward to protect our data.

Tom’s Take

I’m getting tired of being told that my data is being spread to the four winds thanks to it lying around waiting to be used for both legitimate and nefarious purposes. We can’t build fences high enough around critical data to keep it from being broken into. We can’t keep people out, so we need to start making the data less valuable. Instead of keeping it all together where it can be reconstructed into something of immense value, we need to make it hard to get all the pieces together at any one time. That means it’s going to be tough for us to build systems that put it all together too. But wouldn’t you rather spend your time solving a fun problem like that rather than making phone calls telling people your SSN got exposed on the open market?

Don’t Build Big Data With Bad Data

Posted on June 16, 2017 by networkingnerd

I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.

Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

Drowning in the Data of Things

Posted on February 24, 2016 by networkingnerd

DrowningSign

If you saw the news coming out of Cisco Live Berlin, you probably noticed that Internet of Things (IoT) was in every other announcement. I wrote about the impact of the new Digital Ceiling initiative already, but I think that IoT is a bit deeper than that. The other thing that seems to go hand in hand with discussion of IoT is big data. And for most of us, that big data is going to be a big problem.

Seen And Not Heard

Internet of Things is about dumb devices getting smart. Think Flowers for Algernon. Only now, instead of them just being smarter they are also going to be very talkative too. The amount of data that these devices used to hold captive will be unleashed on something. We assume that the data is going to be sent to a central collection point or polled from the device by an API call or a program that is mining the data for another party. But do you know who isn’t going to be getting that data? Us.

IoT devices are going to be talking to providers and data collection systems and, in a lot of cases, each other. But they aren’t going to be talking directly to end users or IT staff. That’s because IoT is about making devices intelligent enough to start making their own decisions about things. Remember when SDN came out and everyone started talking about networks making determinations about forwarding paths and topology changes without human inputs? Remember David Meyer talking about network fragility?

Now imagine that’s not the network any more. Imagine it’s everything. Devices talking to other devices and making decisions without human inputs. IoT gives machines the ability to make a limited amount of decisions based on data inputs. Factory floor running a bit too hot for the milling machines to keep up? Talk to the environmental controls and tell it to lower the temperature by two degrees for the next four hours. Is the shelf in the fridge where the milk is stored getting close to the empty milk jug weight? Order more milk. Did a new movie come out on Netflix that meets your viewing guidelines? Add that movie to the queue and have the TV turn on when your phone enters the house geofence.

Think about those processes for moment. All of them are fairly easy conditional statements. If this, then do that. But conditional statements aren’t cut and dried. They require knowledge of constraints and combinations. And all that knowledge comes from data.

More Data, More Problems

All of that data needs to be collected somehow. That means transport networks are going to be stressed now that there are ten times more devices chatting on them. And a good chunk of those devices, especially in the consumer space, are going to be wireless. Hope your wireless network is ready for that challenge. That data is going to be transported to some data sink somewhere. As much as we would like to hope that it’s a collector on our network, the odds are much better that it’s an off-site collector. That means your WAN is about to be stressed too.

How about storing that data? If you are lucky enough to have an onsite collection system you’d better start buying drives for it now. This is a huge amount of data. Nimble Storage has been collecting analytics data from their storage arrays for a while now. Every four hours they collect more data than there are stars in the Milky Way. Makes you wonder where they keep it? And how long are they going to keep that data? Just like the crap in your attic that you swear you’re going to get around to using one day, big data and analytics platforms will keep every shred of information you want to keep for as long you want to have it taking up drive space.

And what about security? Yeah, that’s an even scarier thought. Realize that many of the breaches we’ve read about in the past months have been hackers having access to systems for extended periods of time and only getting caught after they have exfiltrated data from the system. Think about what might happen if a huge data sink is sitting around unprotected. Sure, terabytes worth of data may be noticed if someone tries to smuggle it out of the DLP device. But all it takes is a quick SQL query against the users tables for social security numbers, a program to transpose those numbers into letters to evade the DLP scanner, and you can just email the file to yourself. Script a change from letters back to numbers and you’ve got a gold mine that someone left unlocked and lying around. We may be concentrating on securing the data in flight right now, but even the best armored car does no good if you leave the bank vault door open.

Tom’s Take

This whole thing isn’t rain clouds and doom and gloom. IoT and Big Data represent a huge challenge for modern systems planning. We have the ability to unlock insight from devices that couldn’t tell us their secrets before. But we have to know how deep that pool will be before we dive in. We have to understand what these devices represent before we connect them. We don’t want our thermostats DDoSing our home networks any more than we want the milling machines on the factory floor coming to life and trying to find Sarah Connor. But the challenges we have with transporting, storing, and securing the data from IoT devices is no different than trying to program on punch cards or figure out how to download emails from across the country. Technology will give us the key to solve those challenges. Assuming we can keep our head above water.

Invalidating Identity Interdiction

Posted on July 21, 2015 by networkingnerd

It used to be that a data breach was a singular event that caused massive shock and concern. Today, data breaches happen regularly and, while still shocking in scope, are starting to dull the senses. Credit card numbers, security clearances, and even illicit dating profiles have been harvested, coallated, and provided for everyone to expose. It seems to be an insurmountable problem. But why?

Data Cake

Data is a tantalizing thing. Collecting it makes life easier for customers and providers as well. Having your ordering history allows Amazon to suggest products you might like to buy. Having your address on file allows the pizza place to pull it up without you needing to read your address again. Creating a user account on a site lets you set preferences. All of this leads to a custom experience and lets us feel special and unique.

But, data is just like that slice of cheesecake you think you want for dessert. It looks so delicious and tempting. But you know it’s bad for you. It has calories and sugar and very little nutritional value. In the same manner, all that data you collect is a time bomb waiting to be exposed. The more data you collect, the larger the blowback for your eventual exposure.

Yes, we’ve crossed the line from “I might get hacked one day” to “I will absolutely be hacked in the next 24 months”. The amount of data being stored has increased a hundred fold in the past few years. Every website wants you to sign up. Every department store and restaurant has a preferred customer program. Everyone has a mobile app. And every one of these respositories has data that you don’t want anyone else to have. Hackers used to have to sift through garbage to find sensitive information. Now it can be stolen with a few redirects and no smelly excursions.

Even if a website or app claims they aren’t collecting your data for “nefarious” purposes, you can be sure it will eventually be used against you. And those are the “good guys”. What about sites that won’t let you delete your account? Or worse: the sites that claim they will and then don’t do it?

Having your data laying around in a dormant database is like putting money under a mattress. It doesn’t do the holding company any good. Just like the siren song of the above mentioned cheesecake, if you leave the data laying around long enough a company will decide to do analysis on it. It’s a slippery slope that a company will fall down given enough time.

Identity Isolation

How do we fix the problems of widespread data sprawl? Given that every app and website login is now an attack surface, how can we minimize the amount of leakage even in the event of disaster? Given that the best solution of not collecting the data in the first place isn’t likely to happen, we need a new solution.

Interestingly enough, wireless companies have stumbled onto the solution in the past year. They are using social media sites as a login for wireless access. Sites like Facebook contain all of the information you would need for a user account on a site. Facebook even offers a login option for many sites, tying their database entry to yours.

What needs to happen is that a site like Facebook or OpenID needs to become the de facto repository for identity information. By containing it all in one place and forcing sites to create links to your identity store, the amount of sensitive data being stored is minimized. Apps can still collect custom information above and beyond that which is provided by the identity store, but it would be paired with a GUID value pointing to an external login with very little identifing information. Like having a phone number but no name or address.

I know exactly what you’re thinking right now: What happens if the identity store gets compromised? Won’t they have all of my data? And my associations with other sites? The answer is “yes”. A compromise of the identity store would cause a lot of headache. At the same time, the keeper of the identity store knows that. When is the last time you’ve heard about Facebook having a data breach? Just like banks are serious about security due to public perception of them being robbed, so too would an identity service suffer if it were to be compromised.

Ask yourself this question: would you rather trust your data security to a company like Facebook, who has everything to lose if they are compromised? Or some app developer in a hurry to create a login profile that forgets to use encrypted passwords or salted hashes?

Tom’s Take

The amount of times I’ve had replacement credit cards issued because of potential data hacks is becoming annoying. I’m almost certain my data was in the OPM hack even though it was something from over a decade ago. It’s becoming irritating to know that my information is just a brute force attack away from being exposed and I have no control over it.

Making Facebook or similar identity brokers the authoritative database for identity is an imperfect solution. But it’s also the best solution we have today to the escalating problem of data breaches. I’d rather trust that Facebook is doing as much as possible to safeguard my data than believe for an instant that Home Depot cares about their preferred customer database over the credit card repository. Facebook may not be the most upstanding organization when it comes to data analytics, but I trust them not to treat my identity like a steaming pile of toxic waste.

Time For A Data Diet?

Posted on January 6, 2015 by networkingnerd

Embed from Getty Images

I’m running out of drive space. Not just on my laptop SSD or my desktop HDD. But everywhere. The amount of data that I’m storing now is climbing at an alarming rate. What’s worse is that I often forget I have some of it until I go spelunking back through my drive to figure out what’s taking up all that room. And it’s a problem that the industry is facing too.

The Data Junkyard

Data is accumulating. You can’t deny that. Two factors have lead to this. The first is that we now log more data from things than ever before. In this recent post from Chris Evans (@ChrisMEvans), he mentions that Virgin Atlantic 787s are generating 500GB of data per flight. I’m sure that includes telemetry, aircraft performance, and other debugging information that someone at some point deemed crucial. In another recent article from Jacques Mattheij (@JMattheij), he mentions that app developers left the debug logging turned on, generating enormous data files as the system was in operation.

Years ago we didn’t have the space to store that much data. We had to be very specific about what needed to be capture and stored for long periods of time. I can remember having a 100MB hard drive in my first computer. I can also remember uninstalling and deleting several things in order to put a new program on. Now there is so much storage space that we don’t worry about running out unless a new application makes outrageous demands.

You Want To Use The Data?

The worst part about all this data accumulation is that once it’s been stored, no one ever looks at it again. This isn’t something that’s specific to electronic data, though. I can remember seeing legal offices with storage closets dedicated to boxes full of files. Hospitals have services that deal with medical record storage. In the old days, casinos hired vans to shuffle video tapes back and forth between vaults and security offices. All that infrastructure just on the off-chance that you might need the data one day.

With Big Data being a huge funding target and buzzword source today, you can imagine that every other startup in the market is offering to give you some insight into all that data that you’re storing. I’ve talked before about the drive for analysis of data. It’s the end result of companies trying to make sense of the chaos. But what about the stored data?

Odds are good that it’s going to just sit there in perpetuity. Once the analysis is done on all this data, it will either collect dust in a virtual file box until it is needed again (perhaps) in the far future or it will survive until the next SAN outage and never be reconstructed from backup. The funny thing about this collected data cruft is that no one misses it until the subpoena comes.

Getting Back To Fighting Weight

The solution to the problem isn’t doing more analysis on data. Instead, we need to start being careful about what data we’re storing in the first place. When you look at personal systems like Getting Things Done, they focus on stemming the flow of data quickly to give people more time to look at the important things. In much the same way, instead of capturing every bit coming from a data source and deciding later what to do with it, the decision needs to be made right away. Data Scientists need to start thinking like they’re on a storage budget, not like they’ve been handed the keys to the SAN kingdom.

I would be willing to bet that a few discrete decisions in the data collection process about what to keep and what to throw away would significantly cut down on the amount data we need to store and process. Less time spent querying and searching through that mess would optimize data retrieval systems and make our infrastructure run much faster. Think of it like spring cleaning for the data garage.

Tom’s Take

I remember a presentation at Networking Field Day a few years ago when Statseeker told us that they could scan data points from years in the past down to the minute. The room collectively gasped. How could you look that far back? How big are the drives in your appliance? The answer was easy: they don’t store every little piece of data coming from the system. They instead look at very specific things that tell them about the network and then record those with an eye to retrieval in the future. They optimize at storage time to help the impact of lookup in the future.

Rather than collecting everything in the world in the hopes that it might be useful, we need to get away from the data hoarding mentality and trim down to something more agile. It’s the only way our data growth problem is going to get better in the near future.

If you’d like to hear some more thoughts on the growing data problem, be sure to check out the Tech Talk sponsored by Fusion-io.

Replacing Nielsen With Big Data

Posted on March 12, 2014 by networkingnerd

Embed from Getty Images

I’m a fan of television. I like watching interesting programs. It’s been a few years since one has caught my attention long enough to keep my watching for multiple seasons. Part of that reason is due to my fear that an awesome program is going to fail to reach a “target market” and will end up canceled just when it’s getting good. It’s happened to several programs that I liked in the past.

Sampling the Goods

Part of the issue with tracking the popularity of television programs comes from the archaic method in which programs are measured. Almost everyone has heard of the Nielsen ratings system. This sampling method was created in the 1930s as a way to measure radio advertising reach. In the 50s, it was adapted for use in television.

Nielsen selects target audiences that represent the greater whole. They ask users to keep written diaries of their television watching habits. They also have the ability to install a device called a set meter which allows viewers to punch in a code to identify themselves via age groups and lock in a view for a program. The set meter can tell the instant a channel is changed or the TV is powered off.

In theory, the sampling methodology is sound. In practice, it’s a bit shaky. View diaries are unreliable because people tend to overreport their view habits. If they feel guilty that they haven’t been writing anything down, they tend to look up something that was on TV and write it down. Diaries also can’t determine if a viewer watched the entire program or changed the channel in the middle. Set meters aren’t much better. The reliance on PIN codes to identify users can lead to misreported results. Adults in a hurry will sometimes punch in an easier code assigned to their children, leading to skewed age results.

Both the diary and the set meter fail to take into account the shift in viewing habits in modern households. Most of the TV viewing in my house takes place through time-shifted DVR recordings. My kids tend to monopolize the TV during the day, but even they are moving to using services like Netflix to watch all episodes of their favorite cartoons in one sitting. Neither of these viewing habits are easily tracked by Nielsen.

How can we find a happy medium? Sample sizes have been reduced signifcantly due to cord-cutting households moving to Internet distribution models. People tend to exaggerate or manipulate self-reported viewing results. Even “modern” Nielsen technology can’t keep up. What’s the answer?

Big Data

I know what you’re saying: “We’ve already got that with Nielsen, right?” Not quite. TV viewing habits have shifted in the past few years. So has TV technology. Thanks to the shift from analog broadcast signals to digital and the explosion of set top boxes for cable decryption and movie service usage, we now have a huge portal into the living room of every TV watcher in the world.

Think about for a moment. The idea of a sample size works provided it’s a good representative sample. But tracking this data is problematic. If we have access to a way to crunch the actual data instead of extrapolating from incomplete sets shouldn’t we use that instead? I’d rather believe the real numbers instead of trying to guess from unreliable sources.

This also fixes the issue of time-shifted viewing. Those same set top boxes are often responsible for recording the programs. They could provide information such as number of shows recorded versus viewed and whether or not viewers skip through commercials. For those that view on mobile devices that data could be compiled as well through integration with the set top box. User logins are required for mobile apps as it is today. It’s just a small step to integrating the whole package.

It would require a bit of technical upgrading on the client side. We would have to enable the set top boxes to report data back to a service. We could anonymize the data to a point to be sure that people aren’t being unnecessarily exposed. It will also have to be configured as an opt-out setting to ensure that the majority is represented. Opt-in won’t work because those checkboxes never get checked.

Advertisers are going to demand the most specific information about people that they can. The ratings service exists to work for the advertisers. If this plan is going to work, a new company will have to be created to collect and analyze this data. This way the analysis company can ensure that the data is specific enough to be of use to the advertisers while at the same time ensuring the protection of the viewers.

Tom’s Take

Every year, promising new TV shows are yanked off the airwaves because advertisers don’t see any revenue. Shows that have a great premise can’t get up to steam because of ratings. We need to fix this system. In the old days, the deluge of data would have drown Nielsen. Today, we have the technology to collect, analyze, and store that data for eternity. We can finally get the real statistics on how many people watched Jericho or After MASH. Armed with real numbers, we can make intelligent decisions about what to keep on TV and what to jettison. And that’s a big data project I’d be willing to watch.

Big Data? Or Big Analysis?

Posted on June 20, 2013 by networkingnerd

data-illustration

Unless you’ve been living under a rock for the past few years, you’ve no doubt heard all about the problem that we have with big data. When you start crunching the numbers on data sets in the terabyte range the amount of compute power and storage space that you have to dedicate to the endeavor is staggering. Even at Dell Enterprise Forum some of the talk in the keynote addresses focused on the need to split the processing of big data down into more manageable parallel sets via use of new products such as the VRTX. That’s all well and good. That is, it’s good if you actually believe the problem is with the data in the first place.

Data Vs. Information

Data is just description. It’s a raw material. It’s no more useful to the average person than cotton plants or iron ore. Data is just a singular point on a graph with no axes. Nothing can be inferred from that data point unless you process it somehow. That’s where we start talking about information.

Information is the processed form of data. It’s digestible and coherent. It’s a collection of data points that tell a story or support a hypothesis. Information is actionable data. When I have information on something, I can make a decision or present my findings to someone to make a decision. They key is that it is a second-order product. Information can’t exist without data upon which to perform some kind of analysis. And therein lies the problem in our growing “big data” conundrum.

Big Analysis

Data is very sedentary. It doesn’t really do much after it’s collected. It may sit around in a database for a few days until someone needs to generate information from it. That’s where analysis comes into play. A table is just a table. It has a height and a width. It has a composition. That’s data. But when we analyze that table, we start generating all kinds of additional information about it. Is it comfortable to sit at the table? What color lamp goes best with it? Is it hard to move across the floor? Would it break if I stood on it? All of that analysis is generated from the data at hand. The data didn’t go anywhere or do anything. I created all that additional information solely from the data.

Look at the above Wikipedia entry for big data. The image on the screen is one of the better examples of information spiraling out of control from analysis of a data set. The picture is a visual example of Wikipedia edits. Note that it doesn’t have anything to do with the data contained in a particular entry. They’re just tracking what people did to describe that data or how they analyzed it. We’ve generated terabytes of information just doing change tracking on a data set. All that data needs to be stored somewhere. That’s what has people in IT sales salivating.

Guilt By Association

If you want to send a DBA screaming into the night, just mention the words associative entity (or junction table). In another lifetime, I was in college to become a DBA. I went through Intro to Databases and learned about all the constructs that we use to contain data sets. I might have even learned a little SQL by accident. What I remember most was about entities. Standard entities are regular data. They have a primary key that describes a row of data, such as a person or a vehicle. That data is pretty static and doesn’t change often. Case in point – how accurate is the height and weight entry on your driver’s license?

Associative entities, on the other hand, represent borderline chaos. These are analysis nodes. They contain more than one primary key as a reference to at least two tables in a database. They are created when you are trying to perform some kind of analysis on those tables. They can be ephemeral and usually are generated on demand by things like SQL queries. This is the heart of my big data / big analysis issue. We don’t really care about the standard data entities. We only want the analysis and information that we get from the associative entities. The more information and analysis we desire, the more of these associative entities we create. Containing these descriptive sets is causing the explosion in storage and compute costs. The data hasn’t really grown. It’s our take on the data that has.

Crunch Time

What can we do? Sadly, not much. Our brains are hard-wired to try and make patterns out of seeming unconnected things. It is a natural reaction that we try to bring order to chaos. Given all of the data in the world the first thing we are going to want to do with it is try and make sense of it. Sure, we’ve found some very interesting underlying patterns through analysis such as the well-worn story from last year of Target determining a girl was pregnant before her family knew. The purpose of all that analysis was pretty simple – Target wanted to know how to better pitch products to a specific focus groups of people. They spent years of processing time and terabytes of storage all for the lofty goal of trying to figure out what 18-24 year old males are more likely to buy during the hours of 6 p.m. to 10 p.m. on weekday evening. It’s a key to the business models of the future. Rather than guessing what people want, we have magical reports that tell us exactly what they want. Why do you think Facebook is so attached to the idea of “liking” things? That’s an advertisers dream. Getting your hands on a second-order analysis of Facebook’s Like database would be the equivalent of the advertising Holy Grail.

Tom’s Take

We are never going to stop creating analysis of data. Sure, we may run out of things to invent or see or do, but we will never run out of ways to ask questions about them. As long as pivot tables exist in Excel or inner joins happen in an Oracle database people are going to be generating analysis of data for the sake of information. We may reach a day where all that information finally buries us under a mountain of ones and zeroes. We brought it on ourselves because we couldn’t stop asking questions about buying patterns or traffic behaviors. Maybe that’s the secret to Zen philosophy after all. Instead of concentrating on the analysis of everything, just let the data be. Sometimes just existing is enough.

The Networking Nerd

Networking With A Side of Snark

Tag Archives: Big data

Data Is Not The New Oil, It’s Nuclear Power

Black Gold, Texas Tea

Welcome To The Nuclear Age

Tom’s Take

Devaluing Data Exposures

Just Pick Up The Phone

Data Is The New Oil

Devaluing Data

Tom’s Take

Don’t Build Big Data With Bad Data

Garbage In, Garbage Out

Do You See What I See?

Tom’s Take

Drowning in the Data of Things

Seen And Not Heard

More Data, More Problems

Tom’s Take

Time For A Data Diet?

The Data Junkyard

You Want To Use The Data?

Getting Back To Fighting Weight

Tom’s Take

Replacing Nielsen With Big Data

Big Data? Or Big Analysis?

Can AI Really Work for Enterprises?

No Free Lunchboxes

Tom’s Take

Share this:

Black Gold, Texas Tea

Welcome To The Nuclear Age

Tom’s Take

Share this:

Just Pick Up The Phone

Data Is The New Oil

Devaluing Data

Tom’s Take

Share this:

Garbage In, Garbage Out

Do You See What I See?

Tom’s Take

Share this:

Seen And Not Heard

More Data, More Problems

Tom’s Take

Share this:

Data Cake

Identity Isolation

Tom’s Take

Share this:

The Data Junkyard

You Want To Use The Data?

Getting Back To Fighting Weight

Tom’s Take

Share this:

Share this:

Share this: