Who Wants To Save Forever?

Save-icon

At the recent SpectraLogic summit in Boulder, much of the discussion centered around the idea of storing data and media in perpetuity. Technology has arrived at the point where it is actually cheaper to keep something tucked away rather than trying to figure out whether or not it should be kept. This is leading to a huge influx of media resources being available everywhere. The question now shifts away from storage and to retrieval. Can you really save something forever?

Another One Bites The Dust

Look around your desk. See if you can put your hands on each of the following:

* A USB Flash drive
* A DVD-RW
* A CD-ROM
* A Floppy Disk (bonus points for 5.25")

Odds are good that you can find at least three of those four items. Each of those items represents a common way of saving files in a removal format. I’m not even trying to cover all of the formats that have been used (I’m looking at you, ZIP drives). Each of these formats has been tucked away in a backpack or given to a colleague at some point to pass files back and forth.

Yet, each of these formats has been superseded sooner or later by something better. Floppies were ultraportable and very small. CD-ROMs were much bigger, but couldn’t be re-written without effort. DVD media never really got the chance to take off before bandwidth eclipsed the capacity of a single disc. And USB drives, while the removable media du jour, are mainly used when you can’t connect wirelessly.

Now, with cloud connectivity the idea of having removable media to share files seems antiquated. Instead of copying files to a device and passing it around between machines, you simply copy those files to a central location and have your systems look there. And capacity is very rarely an issue. So long as you can bring new systems online to augment existing storage space, you can effectively store unlimited amounts of data forever.

But how do we extract data from old devices to keep in this new magical cloud? Saving media isn’t that hard. But getting it off the source is proving to be harder than one might think.

Take video for instance. How can you extract data from an old 8mm video camera? It’s not a standard size to convert to VHS (unless you can find an old converter at a junk store). There are a myriad of ways to extract the data once you get it hooked up to an input device. But what happens if the source device doesn’t work any longer? If your 8mm camera is broken you probably can’t extract your media. Maybe there is a service that can do it, but you’re going to pay for that privilege.

I Want To Break Free

Assuming you can even extract the source media files for storage, we start running into another issue. Once I’ve saved those files, how can I be sure that I can read them fifty years from now? Can I even be sure I can read them five years from now?

Data storage formats are a constantly-evolving discussion. All you have to do is look at Microsoft Office. Office is the most popular workgroup suite in the entire world. All of those files have to be stored in a format that allows them to be read. One might be forgiven for assuming that Microsoft Word document formats are all the same or at least similar enough to be backwards compatible across all versions.

Each new version of the format includes a few new pieces that break backwards compatibility. Instead of leveraging new features like smaller file sizes or increased readability we are faced to continue using old formats like Word 97-2002 in order to ensure that file can be read by whomever they send it to for review.

Even the most portable for formats suffers from this malady. Portable Document Format (PDF) was designed by Adobe to be an application independent way to display files using a printing descriptor language. This means that saving a file as a PDF one system makes it readable on a wide variety of systems. PDF has become the de facto way to share files back and forth.

Yet it can suffer from format issues as well. PDF creation software like Adobe Acrobat isn’t immune from causing formatting problems. Files saved with certain attributes can only be read by updated versions of reader software that can understand them. The idea of a portable format only works when you restrict the descriptors available to the lowest common denominator so that all readers can display the format.

Part of this issue comes from the idea that companies feel the need to constantly “improve” things and force users to continue to upgrade software to be able to read the new formats. While Adobe has offered the PDF format to ISO for standardization, adding new features to the process takes time and effort. Adobe would rather have you keep buying Acrobat to make PDFs and downloading new versions to Reader to decode those new files. It’s a win-win situation for them and not as much of one for the consumers of the format.


Tom’s Take

I find it ironic that we have spent years of time and millions of dollars trying to find ways to convert data away from paper and into electronic formats. The irony is that those papers that we converted years ago are more readable that the data that we stored in the cloud. The only limitation of paper is how long the actual paper can last before being obliterated.

Think of the Rosetta Stone or the Code of Hammurabi. We know about these things because they were etched into stone. Literally. Yet, in the case of the Rosetta Stone we ran into file format issues. It wasn’t until we were able to save the Egyptian hieroglyphs as Greek that we were able to read them. If you want your data to stand the test of time, you need to think about more than the cloud. You need to make sure that you can retrieve and read it as well.

Time For A Data Diet?

Embed from Getty Images

I’m running out of drive space. Not just on my laptop SSD or my desktop HDD. But everywhere. The amount of data that I’m storing now is climbing at an alarming rate. What’s worse is that I often forget I have some of it until I go spelunking back through my drive to figure out what’s taking up all that room. And it’s a problem that the industry is facing too.

The Data Junkyard

Data is accumulating. You can’t deny that. Two factors have lead to this. The first is that we now log more data from things than ever before. In this recent post from Chris Evans (@ChrisMEvans), he mentions that Virgin Atlantic 787s are generating 500GB of data per flight. I’m sure that includes telemetry, aircraft performance, and other debugging information that someone at some point deemed crucial. In another recent article from Jacques Mattheij (@JMattheij), he mentions that app developers left the debug logging turned on, generating enormous data files as the system was in operation.

Years ago we didn’t have the space to store that much data. We had to be very specific about what needed to be capture and stored for long periods of time. I can remember having a 100MB hard drive in my first computer. I can also remember uninstalling and deleting several things in order to put a new program on. Now there is so much storage space that we don’t worry about running out unless a new application makes outrageous demands.

You Want To Use The Data?

The worst part about all this data accumulation is that once it’s been stored, no one ever looks at it again. This isn’t something that’s specific to electronic data, though. I can remember seeing legal offices with storage closets dedicated to boxes full of files. Hospitals have services that deal with medical record storage. In the old days, casinos hired vans to shuffle video tapes back and forth between vaults and security offices. All that infrastructure just on the off-chance that you might need the data one day.

With Big Data being a huge funding target and buzzword source today, you can imagine that every other startup in the market is offering to give you some insight into all that data that you’re storing. I’ve talked before about the drive for analysis of data. It’s the end result of companies trying to make sense of the chaos. But what about the stored data?

Odds are good that it’s going to just sit there in perpetuity. Once the analysis is done on all this data, it will either collect dust in a virtual file box until it is needed again (perhaps) in the far future or it will survive until the next SAN outage and never be reconstructed from backup. The funny thing about this collected data cruft is that no one misses it until the subpoena comes.

Getting Back To Fighting Weight

The solution to the problem isn’t doing more analysis on data. Instead, we need to start being careful about what data we’re storing in the first place. When you look at personal systems like Getting Things Done, they focus on stemming the flow of data quickly to give people more time to look at the important things. In much the same way, instead of capturing every bit coming from a data source and deciding later what to do with it, the decision needs to be made right away. Data Scientists need to start thinking like they’re on a storage budget, not like they’ve been handed the keys to the SAN kingdom.

I would be willing to bet that a few discrete decisions in the data collection process about what to keep and what to throw away would significantly cut down on the amount data we need to store and process. Less time spent querying and searching through that mess would optimize data retrieval systems and make our infrastructure run much faster. Think of it like spring cleaning for the data garage.


Tom’s Take

I remember a presentation at Networking Field Day a few years ago when Statseeker told us that they could scan data points from years in the past down to the minute. The room collectively gasped. How could you look that far back? How big are the drives in your appliance? The answer was easy: they don’t store every little piece of data coming from the system. They instead look at very specific things that tell them about the network and then record those with an eye to retrieval in the future. They optimize at storage time to help the impact of lookup in the future.

Rather than collecting everything in the world in the hopes that it might be useful, we need to get away from the data hoarding mentality and trim down to something more agile. It’s the only way our data growth problem is going to get better in the near future.


If you’d like to hear some more thoughts on the growing data problem, be sure to check out the Tech Talk sponsored by Fusion-io.