AI Is Making Data Cost Too Much


You may recall that I wrote a piece almost six years ago comparing big data to nuclear power. Part of the purpose of that piece was to knock the wind out of the “data is oil” comparisons that were so popular. Today’s landscape is totally different now thanks to the shifts that the IT industry has undergone in the past few years. I now believe that AI is going to cause a massive amount of wealth transfer away from the AI companies and cause startup economics to shift.

Can AI Really Work for Enterprises?

In this episode of Packet Pushers, Greg Ferro and Brad Casemore debate a lot of topics around the future of networking. One of the things that Brad brought up that Greg pointed out is that data being used for AI algorithm training is being stored in the cloud. That massive amount of data is sitting there waiting to be used between training runs and it’s costing some AI startups a fortune in cloud costs.

AI algorithms need to be trained to be useful. When someone uses ChatGPT to write a term paper or ask nonsensical questions you’re using the output of the GPT training run. The real work happens when OpenAI is crunching data and feeding their monster. They have to give it a set of parameters and data to analyze in order to come up with the magic that you see in the prompt window. That data doesn’t just come out of nowhere. It has to be compiled and analyzed.

There are a lot of creators of content that are angry that their words are being fed into the GPT algorithm runs and then being used in the results without giving proper credit. That means that OpenAI is scraping the content from the web and feeding it into the algorithm without care for what they’re looking it. It also creates issues where the validity and the accuracy of the data isn’t verified ahead of time.

Now, this focuses on OpenAI and GPT specifically because everyone seems to think that’s AI right now. Much like every solution in the history of IT, GPT-based large language models (LLMs) are just a stepping stone along the way to greater understanding of what AI can do. The real value for organizations, as Greg pointed out in the podcast, can be something as simple as analyzing the trouble ticket a user has submitted and then offering directed questions to help clarify the ticket for the help desk so they spend less time chasing false leads.

No Free Lunchboxes

Where are organizations going to store that data? In the old days it was going to be collected in on-prem storage arrays that weren’t being used for anything else. The opportunity cost of using something you already owned was minimal. After all, you bought the capacity so why not use it? Organizations that took this approach decided to just save every data point they could find in an effort to “mine” for insights later. Hence the references to oil and other natural resources.

Today’s world is different. LLMs need massive resources to run. Unless you’re willing to drop several million dollars to build out your own cluster resources and hire engineers to keep the running at peak performance you’re probably going to be using a hosted cloud solution. That’s easy enough to set up and run. And you’re only paying for what you use. CPU and GPU times are important so you want the job to complete as fast as possible in order to keep your costs low.

What about the data that you need to feed to the algorithm? Are you going to feed it from your on-prem storage? That’s way too slow, even with super fast WAN links. You need to get the data as close to the processors as possible. That means you need to migrate it into the cloud. You need to keep it there while the magic AI building machine does the work. Are you going to keep that valuable data in the cloud, incurring costs every hour it’s stored there? Or are you going to pay to have it moved back to your enterprise? Either way the sound of a cash register is deafening to your finance department and music to the ears of cloud providers and storage vendors selling them exabytes of data storage.

All those hopes of making tons of money from your AI insights are going to evaporate in a pile of cloud bills. The operations costs of keeping that data are now more than minimal. If you want to have good data to operate on you’re going to need to keep it. And if you can’t keep it locally in your organization you’re going to have to pay someone to keep it for you. That means writing big checks to the cloud providers that have effectively infinite storage, bounded only by the limit on your credit card or purchase order. That kind of wealth transfer makes investors seem a bit hesitant when they aren’t going to get the casino-like payouts they’d been hoping for.

The shift will cause AI startups to be very frugal in what they keep. They will either amass data only when they think their algorithm is ready for a run or keep only critical data that they know they’re going to need to feed the monster. That means they’re going to be playing a game with the accuracy of the resulting software as well as giving up chances that some insignificant piece of data ends up being the key to a huge shift. In essence, the software will all start looking and sounding the same after a while and there won’t be enough differentiation to make they competitive because no one will be able to afford it.


Tom’s Take

The relative ease with which data could be stored turned companies into data hoarders. They kept it forever hoping they could get some value out of it and create a return curve that soared to the moon. Instead, the application for that data mining finally came along and everyone realized that getting the value out of the data meant investing even more capital into refining it. That kind of investment makes those curves much flatter and makes investors more reluctant. That kind of shift means more work and less astronomical payout. All because your resources were more costly than you first thought.

Leave a comment