Time For A Data Diet?

Embed from Getty Images

I’m running out of drive space. Not just on my laptop SSD or my desktop HDD. But everywhere. The amount of data that I’m storing now is climbing at an alarming rate. What’s worse is that I often forget I have some of it until I go spelunking back through my drive to figure out what’s taking up all that room. And it’s a problem that the industry is facing too.

The Data Junkyard

Data is accumulating. You can’t deny that. Two factors have lead to this. The first is that we now log more data from things than ever before. In this recent post from Chris Evans (@ChrisMEvans), he mentions that Virgin Atlantic 787s are generating 500GB of data per flight. I’m sure that includes telemetry, aircraft performance, and other debugging information that someone at some point deemed crucial. In another recent article from Jacques Mattheij (@JMattheij), he mentions that app developers left the debug logging turned on, generating enormous data files as the system was in operation.

Years ago we didn’t have the space to store that much data. We had to be very specific about what needed to be capture and stored for long periods of time. I can remember having a 100MB hard drive in my first computer. I can also remember uninstalling and deleting several things in order to put a new program on. Now there is so much storage space that we don’t worry about running out unless a new application makes outrageous demands.

You Want To Use The Data?

The worst part about all this data accumulation is that once it’s been stored, no one ever looks at it again. This isn’t something that’s specific to electronic data, though. I can remember seeing legal offices with storage closets dedicated to boxes full of files. Hospitals have services that deal with medical record storage. In the old days, casinos hired vans to shuffle video tapes back and forth between vaults and security offices. All that infrastructure just on the off-chance that you might need the data one day.

With Big Data being a huge funding target and buzzword source today, you can imagine that every other startup in the market is offering to give you some insight into all that data that you’re storing. I’ve talked before about the drive for analysis of data. It’s the end result of companies trying to make sense of the chaos. But what about the stored data?

Odds are good that it’s going to just sit there in perpetuity. Once the analysis is done on all this data, it will either collect dust in a virtual file box until it is needed again (perhaps) in the far future or it will survive until the next SAN outage and never be reconstructed from backup. The funny thing about this collected data cruft is that no one misses it until the subpoena comes.

Getting Back To Fighting Weight

The solution to the problem isn’t doing more analysis on data. Instead, we need to start being careful about what data we’re storing in the first place. When you look at personal systems like Getting Things Done, they focus on stemming the flow of data quickly to give people more time to look at the important things. In much the same way, instead of capturing every bit coming from a data source and deciding later what to do with it, the decision needs to be made right away. Data Scientists need to start thinking like they’re on a storage budget, not like they’ve been handed the keys to the SAN kingdom.

I would be willing to bet that a few discrete decisions in the data collection process about what to keep and what to throw away would significantly cut down on the amount data we need to store and process. Less time spent querying and searching through that mess would optimize data retrieval systems and make our infrastructure run much faster. Think of it like spring cleaning for the data garage.


Tom’s Take

I remember a presentation at Networking Field Day a few years ago when Statseeker told us that they could scan data points from years in the past down to the minute. The room collectively gasped. How could you look that far back? How big are the drives in your appliance? The answer was easy: they don’t store every little piece of data coming from the system. They instead look at very specific things that tell them about the network and then record those with an eye to retrieval in the future. They optimize at storage time to help the impact of lookup in the future.

Rather than collecting everything in the world in the hopes that it might be useful, we need to get away from the data hoarding mentality and trim down to something more agile. It’s the only way our data growth problem is going to get better in the near future.


If you’d like to hear some more thoughts on the growing data problem, be sure to check out the Tech Talk sponsored by Fusion-io.

 

Statseeker – Information Is Ammunition

The first presenter at Network Field Day 4 came to us from another time and place.  Stewart Reed came to us all the way from Brisbane, Australia to talk to us about his network monitoring software from Statseeker.  I’ve seen Statseeker before at Cisco Live and you likely have too if you been.  They’re the group that always gives away a Statseeker-themed Mini on the show floor.  They’ve also recently done a podcast with the Packet Pushers.

We got into the room with Stewart and he gave us a great overview of who Statseeker is and what they do:

He’s a great presenter and really hits on the points that differentiates Statseeker.  I was amazed by the fact that they said they can keep historical data for a very long period of time.  I’ve managed to crash a network monitoring system years ago by trying to monitor too many switch ports.  Keeping up with all that information was like drinking from a firehose.  Trying to keep that data for long periods of time was a fantasy.  Statseeker, on the other hand, has managed to find a way to not only keep up with all that information but keep it around for later use.  Stewart said one of my new favorite quotes during the presentation, “Whoever has the best notes wins.”  Not only do they have notes that go back for a long time, but their notes don’t suffer from averaging abstraction.  When most systems say that they keep data for long periods of time, what they really mean is that they keep the 15 or 30 minute average data for a while.  I’ve even seen some go to day or week data points in order to reduce the amount of stored data.  Statseeker takes one minute data polls and keeps those one minute data polls for the life of the data.  I can drill into the interface specs at 8:37 on June 10th, 2008 if I want.  Do you think anyone really wants to argue with someone that keeps notes like that?

Of course, what would Network Field Day be without questions:

One of the big things that comes right out in this discussion is the idea that Statseeker doesn’t allow for customer SNMP monitoring.  By restricting the number of OIDs that can be monitored to a smaller subset, this allows for the large-scale port monitoring and long term data storage that Statseeker can provide.  I mean, when you get right down to it, how many times have you had to write your own custom SNMP query for an odd OID?  The majority of the customers that Statseeker are likely going to have something like 90% overlap in what they want to look at.  Restricting the ability to get crazy with monitoring makes this product simple to install and easy to manage.  At the risk of overusing a cliche, this is more in line with Apple model of restriction with focus on ease of use.  Of course, if Statseeker wants to start referring to themselves as the Apple of Network Monitoring, by all means go right ahead.

The other piece from this second video that I liked was the mention that the minimum Statseeker license is 1000 units.  Stewart admits that below that price point, it argument for Statseeker begins to break down somewhat.  This kind of admission is refreshing in the networking world.  You can’t be everything to everyone.  By focusing on long term data storage and quick polling intervals, you obviously have to scale your system to hit a specific port count target.  If you really want to push that same product down into an environment that only monitors around 200 ports, you are going to have to make some concessions.  You also have to compete with smaller, cheaper tools like MRTG and Cacti. I love that they know where they compete best and don’t worry about trying to sell to everyone.

Of course, a live demo never hurts:

If you’d like to learn more about Statseeker, you can head over to their website at http://www.statseeker.com/.  You can also follow them on Twitter as @statseeker.  Be sure to tell them to change their avatar and tweet more.  You can see hear about Statseeker’s presentation in the Packet Pushers Priority Queue Show 14.


Tom’s Take

Statseeker has some amazing data gathering capabilities.  I personally have never needed to go back three years to win an argument about network performance, but knowing that I can is always nice.  Add in the fact that I can monitor every port on the network and you can see the appeal.  I don’t know if Statseeker really fits into the size of environment that I typically work in, but it’s nice to know that it’s there in case I need it.  I expect to see some great things from them in the future and I might even put my name in the hat for the car at Cisco Live next year.

Tech Field Day Disclaimer

Statseeker was a sponsor of Network Field Day 4.  As such, they were responsible for covering a portion of my travel and lodging expenses while attending Network Field Day 4. They did not ask for, nor where they promised any kind of consideration in the writing of this review.  The opinions and analysis provided within are my own and any errors or omissions are mine and mine alone.

Additional Network Field Day 4 Coverage:

StatseekerThe Lone Sysadmin

Statseeker – Keeping An Eye On The Little ThingsLamejournal