At the recent SpectraLogic summit in Boulder, much of the discussion centered around the idea of storing data and media in perpetuity. Technology has arrived at the point where it is actually cheaper to keep something tucked away rather than trying to figure out whether or not it should be kept. This is leading to a huge influx of media resources being available everywhere. The question now shifts away from storage and to retrieval. Can you really save something forever?
Another One Bites The Dust
Look around your desk. See if you can put your hands on each of the following:
* A USB Flash drive * A DVD-RW * A CD-ROM * A Floppy Disk (bonus points for 5.25")
Odds are good that you can find at least three of those four items. Each of those items represents a common way of saving files in a removal format. I’m not even trying to cover all of the formats that have been used (I’m looking at you, ZIP drives). Each of these formats has been tucked away in a backpack or given to a colleague at some point to pass files back and forth.
Yet, each of these formats has been superseded sooner or later by something better. Floppies were ultraportable and very small. CD-ROMs were much bigger, but couldn’t be re-written without effort. DVD media never really got the chance to take off before bandwidth eclipsed the capacity of a single disc. And USB drives, while the removable media du jour, are mainly used when you can’t connect wirelessly.
Now, with cloud connectivity the idea of having removable media to share files seems antiquated. Instead of copying files to a device and passing it around between machines, you simply copy those files to a central location and have your systems look there. And capacity is very rarely an issue. So long as you can bring new systems online to augment existing storage space, you can effectively store unlimited amounts of data forever.
But how do we extract data from old devices to keep in this new magical cloud? Saving media isn’t that hard. But getting it off the source is proving to be harder than one might think.
Take video for instance. How can you extract data from an old 8mm video camera? It’s not a standard size to convert to VHS (unless you can find an old converter at a junk store). There are a myriad of ways to extract the data once you get it hooked up to an input device. But what happens if the source device doesn’t work any longer? If your 8mm camera is broken you probably can’t extract your media. Maybe there is a service that can do it, but you’re going to pay for that privilege.
I Want To Break Free
Assuming you can even extract the source media files for storage, we start running into another issue. Once I’ve saved those files, how can I be sure that I can read them fifty years from now? Can I even be sure I can read them five years from now?
Data storage formats are a constantly-evolving discussion. All you have to do is look at Microsoft Office. Office is the most popular workgroup suite in the entire world. All of those files have to be stored in a format that allows them to be read. One might be forgiven for assuming that Microsoft Word document formats are all the same or at least similar enough to be backwards compatible across all versions.
Each new version of the format includes a few new pieces that break backwards compatibility. Instead of leveraging new features like smaller file sizes or increased readability we are faced to continue using old formats like Word 97-2002 in order to ensure that file can be read by whomever they send it to for review.
Even the most portable for formats suffers from this malady. Portable Document Format (PDF) was designed by Adobe to be an application independent way to display files using a printing descriptor language. This means that saving a file as a PDF one system makes it readable on a wide variety of systems. PDF has become the de facto way to share files back and forth.
Yet it can suffer from format issues as well. PDF creation software like Adobe Acrobat isn’t immune from causing formatting problems. Files saved with certain attributes can only be read by updated versions of reader software that can understand them. The idea of a portable format only works when you restrict the descriptors available to the lowest common denominator so that all readers can display the format.
Part of this issue comes from the idea that companies feel the need to constantly “improve” things and force users to continue to upgrade software to be able to read the new formats. While Adobe has offered the PDF format to ISO for standardization, adding new features to the process takes time and effort. Adobe would rather have you keep buying Acrobat to make PDFs and downloading new versions to Reader to decode those new files. It’s a win-win situation for them and not as much of one for the consumers of the format.
I find it ironic that we have spent years of time and millions of dollars trying to find ways to convert data away from paper and into electronic formats. The irony is that those papers that we converted years ago are more readable that the data that we stored in the cloud. The only limitation of paper is how long the actual paper can last before being obliterated.
Think of the Rosetta Stone or the Code of Hammurabi. We know about these things because they were etched into stone. Literally. Yet, in the case of the Rosetta Stone we ran into file format issues. It wasn’t until we were able to save the Egyptian hieroglyphs as Greek that we were able to read them. If you want your data to stand the test of time, you need to think about more than the cloud. You need to make sure that you can retrieve and read it as well.