I’m running out of drive space. Not just on my laptop SSD or my desktop HDD. But everywhere. The amount of data that I’m storing now is climbing at an alarming rate. What’s worse is that I often forget I have some of it until I go spelunking back through my drive to figure out what’s taking up all that room. And it’s a problem that the industry is facing too.
The Data Junkyard
Data is accumulating. You can’t deny that. Two factors have lead to this. The first is that we now log more data from things than ever before. In this recent post from Chris Evans (@ChrisMEvans), he mentions that Virgin Atlantic 787s are generating 500GB of data per flight. I’m sure that includes telemetry, aircraft performance, and other debugging information that someone at some point deemed crucial. In another recent article from Jacques Mattheij (@JMattheij), he mentions that app developers left the debug logging turned on, generating enormous data files as the system was in operation.
Years ago we didn’t have the space to store that much data. We had to be very specific about what needed to be capture and stored for long periods of time. I can remember having a 100MB hard drive in my first computer. I can also remember uninstalling and deleting several things in order to put a new program on. Now there is so much storage space that we don’t worry about running out unless a new application makes outrageous demands.
You Want To Use The Data?
The worst part about all this data accumulation is that once it’s been stored, no one ever looks at it again. This isn’t something that’s specific to electronic data, though. I can remember seeing legal offices with storage closets dedicated to boxes full of files. Hospitals have services that deal with medical record storage. In the old days, casinos hired vans to shuffle video tapes back and forth between vaults and security offices. All that infrastructure just on the off-chance that you might need the data one day.
With Big Data being a huge funding target and buzzword source today, you can imagine that every other startup in the market is offering to give you some insight into all that data that you’re storing. I’ve talked before about the drive for analysis of data. It’s the end result of companies trying to make sense of the chaos. But what about the stored data?
Odds are good that it’s going to just sit there in perpetuity. Once the analysis is done on all this data, it will either collect dust in a virtual file box until it is needed again (perhaps) in the far future or it will survive until the next SAN outage and never be reconstructed from backup. The funny thing about this collected data cruft is that no one misses it until the subpoena comes.
Getting Back To Fighting Weight
The solution to the problem isn’t doing more analysis on data. Instead, we need to start being careful about what data we’re storing in the first place. When you look at personal systems like Getting Things Done, they focus on stemming the flow of data quickly to give people more time to look at the important things. In much the same way, instead of capturing every bit coming from a data source and deciding later what to do with it, the decision needs to be made right away. Data Scientists need to start thinking like they’re on a storage budget, not like they’ve been handed the keys to the SAN kingdom.
I would be willing to bet that a few discrete decisions in the data collection process about what to keep and what to throw away would significantly cut down on the amount data we need to store and process. Less time spent querying and searching through that mess would optimize data retrieval systems and make our infrastructure run much faster. Think of it like spring cleaning for the data garage.
I remember a presentation at Networking Field Day a few years ago when Statseeker told us that they could scan data points from years in the past down to the minute. The room collectively gasped. How could you look that far back? How big are the drives in your appliance? The answer was easy: they don’t store every little piece of data coming from the system. They instead look at very specific things that tell them about the network and then record those with an eye to retrieval in the future. They optimize at storage time to help the impact of lookup in the future.
Rather than collecting everything in the world in the hopes that it might be useful, we need to get away from the data hoarding mentality and trim down to something more agile. It’s the only way our data growth problem is going to get better in the near future.
If you’d like to hear some more thoughts on the growing data problem, be sure to check out the Tech Talk sponsored by Fusion-io.