AI Is Making Data Cost Too Much

Posted on November 7, 2023 by networkingnerd

You may recall that I wrote a piece almost six years ago comparing big data to nuclear power. Part of the purpose of that piece was to knock the wind out of the “data is oil” comparisons that were so popular. Today’s landscape is totally different now thanks to the shifts that the IT industry has undergone in the past few years. I now believe that AI is going to cause a massive amount of wealth transfer away from the AI companies and cause startup economics to shift.

Can AI Really Work for Enterprises?

In this episode of Packet Pushers, Greg Ferro and Brad Casemore debate a lot of topics around the future of networking. One of the things that Brad brought up that Greg pointed out is that data being used for AI algorithm training is being stored in the cloud. That massive amount of data is sitting there waiting to be used between training runs and it’s costing some AI startups a fortune in cloud costs.

AI algorithms need to be trained to be useful. When someone uses ChatGPT to write a term paper or ask nonsensical questions you’re using the output of the GPT training run. The real work happens when OpenAI is crunching data and feeding their monster. They have to give it a set of parameters and data to analyze in order to come up with the magic that you see in the prompt window. That data doesn’t just come out of nowhere. It has to be compiled and analyzed.

There are a lot of creators of content that are angry that their words are being fed into the GPT algorithm runs and then being used in the results without giving proper credit. That means that OpenAI is scraping the content from the web and feeding it into the algorithm without care for what they’re looking it. It also creates issues where the validity and the accuracy of the data isn’t verified ahead of time.

Now, this focuses on OpenAI and GPT specifically because everyone seems to think that’s AI right now. Much like every solution in the history of IT, GPT-based large language models (LLMs) are just a stepping stone along the way to greater understanding of what AI can do. The real value for organizations, as Greg pointed out in the podcast, can be something as simple as analyzing the trouble ticket a user has submitted and then offering directed questions to help clarify the ticket for the help desk so they spend less time chasing false leads.

No Free Lunchboxes

Where are organizations going to store that data? In the old days it was going to be collected in on-prem storage arrays that weren’t being used for anything else. The opportunity cost of using something you already owned was minimal. After all, you bought the capacity so why not use it? Organizations that took this approach decided to just save every data point they could find in an effort to “mine” for insights later. Hence the references to oil and other natural resources.

Today’s world is different. LLMs need massive resources to run. Unless you’re willing to drop several million dollars to build out your own cluster resources and hire engineers to keep the running at peak performance you’re probably going to be using a hosted cloud solution. That’s easy enough to set up and run. And you’re only paying for what you use. CPU and GPU times are important so you want the job to complete as fast as possible in order to keep your costs low.

What about the data that you need to feed to the algorithm? Are you going to feed it from your on-prem storage? That’s way too slow, even with super fast WAN links. You need to get the data as close to the processors as possible. That means you need to migrate it into the cloud. You need to keep it there while the magic AI building machine does the work. Are you going to keep that valuable data in the cloud, incurring costs every hour it’s stored there? Or are you going to pay to have it moved back to your enterprise? Either way the sound of a cash register is deafening to your finance department and music to the ears of cloud providers and storage vendors selling them exabytes of data storage.

All those hopes of making tons of money from your AI insights are going to evaporate in a pile of cloud bills. The operations costs of keeping that data are now more than minimal. If you want to have good data to operate on you’re going to need to keep it. And if you can’t keep it locally in your organization you’re going to have to pay someone to keep it for you. That means writing big checks to the cloud providers that have effectively infinite storage, bounded only by the limit on your credit card or purchase order. That kind of wealth transfer makes investors seem a bit hesitant when they aren’t going to get the casino-like payouts they’d been hoping for.

The shift will cause AI startups to be very frugal in what they keep. They will either amass data only when they think their algorithm is ready for a run or keep only critical data that they know they’re going to need to feed the monster. That means they’re going to be playing a game with the accuracy of the resulting software as well as giving up chances that some insignificant piece of data ends up being the key to a huge shift. In essence, the software will all start looking and sounding the same after a while and there won’t be enough differentiation to make they competitive because no one will be able to afford it.

Tom’s Take

The relative ease with which data could be stored turned companies into data hoarders. They kept it forever hoping they could get some value out of it and create a return curve that soared to the moon. Instead, the application for that data mining finally came along and everyone realized that getting the value out of the data meant investing even more capital into refining it. That kind of investment makes those curves much flatter and makes investors more reluctant. That kind of shift means more work and less astronomical payout. All because your resources were more costly than you first thought.

The Dangers of Knowing Everything

Posted on March 10, 2023 by networkingnerd

By now I’m sure you’ve heard that the Internet is obsessed with ChatGPT. I’ve been watching from the sidelines as people find more and more uses for our current favorite large language model (LLM) toy. Why a toy and not a full-blown solution to all our ills? Because ChatGPT has one glaring flaw that I can see right now that belies its immaturity. ChatGPT knows everything. Or at least it thinks it does.

Unknown Unknowns

If I asked you the answer to a basic trivia question you could probably recall it quickly. Like “who was the first president of the United States?” These are answers we have memorized over the years to things we are expected to know. History, math, and even written communication has questions and answers like this. Even in an age of access to search engines we’re still expected to know basic things and have near-instant recall.

What if I asked you a trivia question you didn’t know the answer to? Like “what is the name of the metal cap at the end of a pencil?” You’d likely go look it up on a search engine or on some form of encyclopedia. You don’t know the answer so you’re going to find it out. That’s still a form of recall. Once you learn that it’s called a ferrule you’ll file it away in the same place as George Washington, 2+2, and the aglet as “things I just know”.

Now, what if I asked you a question that required you to think a little more than just recalling info? Such as “Who would have been the first president if George Washington refused the office?” Now we’re getting into more murky territory. Instead of being able to instantly recall information you’re going to have analyze what you know about the situation. For most people that aren’t history buffs they might recall who Washington’s vice president was and answer with that. History buffs might take more specialized knowledge about matters would apply additional facts and infer a different answer, such as Jefferson or even Samuel Adams. They’re adding more information to the puzzle to come up with a better answer.

Now, for completeness sake, what if I asked you “Who would have become the Grand Vizier of the Galactic Republic if Washington hadn’t been assassinated by the separatists?” You’d probably look at me like I was crazy and say you couldn’t answer a question like that because I made up most of that information or I’m trying to confuse you. You may not know exactly what I’m talking about but you know, based on your knowledge of elementary school history, that there is no Galactic Republic and George Washington was definitely not assassinated. Hold on to this because we’ll come back to it later.

Spinning AI Yarns

How does this all apply to a LLM? The first thing to realize is that LLMs are not replacements for search engines. I’ve heard of many people asking ChatGPT basic trivia and recall type questions. That’s not what LLMs are best at. We have a multitude of ways to learn trivia and none of them need the power of a cloud-scale computing cluster interpreting inputs. Even asking that trivia question to a smart assistant from Apple or Amazon is a better way to learn.

So what does an LLM excel at doing? Nvidia will tell you that it is “a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets”. In essence it can take a huge amount of input, recognize certain aspects of it, and produce content based on the requirements. That’s why ChatGPT can “write” things in the style of something else. It knows what that style is supposed to look and sound like and can produce an output based on that. It analyzes the database and comes up with the results using predictive analysis to create grammatically correct output. Think of it like Advanced Predictive Autocorrect.

If you think I’m oversimplifying what LLMs like ChatGPT can bring to the table then I challenge you to ask it a question that doesn’t have an answer. If you really want to see it work some magic ask it something oddly specific about something that doesn’t exist, especially if that process involves steps or can be broken down into parts. I’d bet you get an answer at least as many times as you get something back that is an error message.

To me, the problem with ChatGPT is that the model is designed to produce an answer unless it has specifically been programmed not to do so. There are a variety of answers that the developers have overridden in the algorithm, usually something racially or politically sensitive. Otherwise ChatGPT is happy to spit out lots of stuff that looks and sounds correct. Case in point? This gem of a post from Joy Larkin of ZeroTier:

https://mastodon.social/@joy/109859024438664366

Short version: ChatGPT gave a user instructions for a product that didn’t exist and the customer was very frustrated when they couldn’t find the software to download on the ZeroTier site. The LLM just made up a convincing answer to a question that involved creating something that doesn’t exist. Just to satisfy the prompt.

Does that sound like a creative writing exercise to you? “Imagine what a bird would look like with elephant feet.” Or “picture a world where people only communicated with dance.” You’ve probably gone through these exercises before in school. You stretch your imagination to take specific inputs and produce outputs based on your knowledge. It’s like the above mention of applied history. You take inputs and produce a logical outcome based on facts and reality.

ChatGPT is immature enough to not realize that some things shouldn’t be answered. If you use a search engine to find the steps to configure a feature on a product the search algorithm will return a page that has the steps listed. Are the correct? Maybe. Depends on how popular the result is. But the results will include a real product. If you search for nonexistent functionality or a software package that doesn’t exist your search won’t have many results.

ChatGPT doesn’t have a search algorithm to rely on. It’s based on language. It’s designed to approximate writing when given a prompt. That means, aside from things it’s been programmed not to answer, it’s going to give you an answer. Is it correct? You won’t know. You’d have to take the output and send it to a search engine to determine if that even exists.

The danger here is that LLMs aren’t smart enough to realize they are creating fabricated answers. If someone asked me how to do something that I didn’t know I would preface my answer with “I’m not quite sure but this is how I think you would do it…” I’ve created a frame of reference that I’m not familiar with the specific scenario and that I’m drawing from inferred knowledge to complete the task. Or I could just answer “I don’t know” and be done with it. ChatGPT doesn’t understand “I don’t know” and will respond with answers that look right according to the model but may not be correct.

Tom’s Take

What’s funny is that ChatGPT has managed to create an approximation of another human behavior. For anyone that has ever worked in sales you know one of the maxims is “never tell the customer ‘no'”. In a way, ChatGPT is like a salesperson. No matter what you ask it the answer is always yes, even if it has to make something up to answer the question. Sci-fi fans know that in fiction we’ve built guardrails for robots to save our society from being harmed by functions. AI, no matter how advanced, needs protections from approximating bad behaviors. It’s time for ChatGPT and future LLMs to learn that they don’t know everything.

Data Is The New Solar Energy

Posted on June 25, 2020 by networkingnerd

You’ve probably been hearing a lot about analytics and artificial intelligence in the past couple of years. Every software platform under the sun is looking to increase their visibility into the way that networks and systems behave. They can then take that data and plug it into a model and make recommendations about the way things need to be configured or designed.

Using analytics to aid troubleshooting is nothing new. We used to be able to tell when hard disks were about to go bad because of SMART reporting. Today we can use predictive analysis to determine when the disk has passed the point of no return and should be replaced well ahead of the actual failure. We can even plug that data into an AI algorithm to determine which drives on which devices need to be examined first based on a chart of performance data.

The power of this kind of data-driven network and systems operation does help our beleaguered IT departments feel as though they have a handle on things. And the power that data can offer to us has it being tracked like a precious natural resource. More than a few times I’ve heard data referred to as “the new oil”. I’d like to turn that on its head though. Data isn’t oil. It’s solar energy.

As Sure As The Sun Comes Up

Oil is created over millions of years. It’s a natural process of layering organic materials with pressure and time to create a new output. Sounds an awful lot like data, right? We create data through the interactions we have with systems. Now, let me ask you the standard Zen kōan, “If a tree falls in the forest and no one is there to hear it, does it make a sound?” More appropriate for this conversation, “If two systems exist without user interaction, do they create data?”

The fact is in today’s technology-driven world that systems are creating data whether we want them to or not. There is output no matter what happens with our interactions. That makes the data ever-present. Like our glorious stellar neighbor. The sun is going to shine no matter what we do. Our planet is going to be bathed in energy no matter if we log on to our email client today or decide to go fishing. Data is going to be generated. What we choose to do with that data determines how we can utilize it.

In order to use oil, it must be processed and refined. It also must be found, drilled out of the ground, and transported to stations where it can be converted to different products. That’s a fairly common way to look at the process of turning data into the more valuable information product we need to make decisions. But in the world of ever-present data do we really need to go looking for it? Honestly, all you need to do is look around and you’ll see it! Kind of like going outside and looking up to find the sun shining down on us.

Let’s get to the processing part. Both forms of energy must be harnessed and concentrated to be useful. Oil requires refineries. Solar power requires the use of plants to consolidate and refine the collected energy from solar panels that generate electricity or heat energy that is converted into steam-powered electricity. In both cases there is infrastructure needed to convert the rew data to information. The key is how we do it.

Our existing infrastructure is based on the petroleum economy of refinement. Our standard consumers of oil are things like cars and trucks and other oil-powered fuel consumers. But the world is changing. Electrically powered vehicles and other devices don’t need the stopgap of oil or petroleum to consume energy. They can get it directly from the electrical grid that can be fed by solar energy. As we’ve adapted our consumption models of energy, we have found better, cleaner, more efficient ways to feed it with less infrastructure. Kind of like how we’ve finally dumped clunky methods for data collection like SNMP or WMI in favor of things like telemetry and open standard models that give us more info in better fashion than traps or alerts. Even the way we handle syslog data today is leaps and bounds better than it used to be.

Lastly, the benefits to standardization on this kind of collection are legion. With solar, the sun isn’t going away for a few billion more years. It’s going to stick around and continue to cause amazing sunrises and give me sunburns for the rest of my days. We’re not going to use up the energy output of the sun even if we tried. Oil has a limited shelf life at best. If we triple the amount of oil we use for the next 30 years we are going to run out completely until more can be made over the next few million years. If we increase the amount of solar energy we use by 10x over the next hundred years we won’t even put a dent in the output of the sun that we received in a minute during that time.

Likewise, with data, moving away from the old methods of collection and reporting mean we can standardize on new systems that give us better capabilities and don’t require us to maintain old standards forever. Anyone that’s ever tried to add a new entry to an archaic old MIB database will know why we need to get more modern. And if that means cleaner data collection all around then so be it.

Tom’s Take

Generally, I despise the allusions to data being some other kind of resource. They’re designed to capture the attention of senior executives that can’t imagine anything that isn’t expensive or in a TV series like Dallas. Instead, we need to help everyone understand the merits of why these kinds of transitions and shifts matter. It’s also important to help executives understand that data needs time and effort to be effective. We can’t just pick up data and shove it into the computers to get results any more than we can shove raw crude oil into a car and expect it to run. Given today’s environmental climate though, I think we need to start relating data to newer, better forms of energy. Just sit back and enjoy the sunshine.

The Cargo Cult of Google Tools

Posted on August 17, 2018 by networkingnerd

You should definitely watch this amazing video from Ben Sigelman of LightStep that was recorded at Cloud Field Day 4. The good stuff comes right up front.

In less than five minutes, he takes apart crazy notions that we have in the world today. I like the observation that you can’t build a system more than three or four orders of magnitude. Yes, you really shouldn’t be using Hadoop for simple things. And Machine Learning is not a magic wand that fixes every problem.

However, my favorite thing was the quick mention of how emulating Google for the sake of using their tools for every solution is folly. Ben should know, because he is an ex-Googler. I think I can sum up this entire discussion in less than a minute of his talk here:

Google’s solutions were built for scale that basically doesn’t exist outside of a maybe a handful of companies with a trillion dollar valuation. It’s foolish to assume that their solutions are better. They’re just more scalable. But they are actually very feature-poor. There’s a tradeoff there. We should not be imitating what Google did without thinking about why they did it. Sometimes the “whys” will apply to us, sometimes they won’t.

Gee, where have I heard something like this before? Oh yeah. How about this post. Or maybe this one on OCP. If I had a microphone I would have handed it to Ben so he could drop it.

Building a Laser Moustrap

We’ve reached the point in networking and other IT disciplines where we have built cargo cults around Facebook and Google. We practically worship every tool they release into the wild and try to emulate that style in our own networks. And it’s not just the tools we use, either. We also keep trying to emulate the service provider style of Facebook and Google where they treated their primary users and consumers of services like your ISP treats you. That architectural style is being lauded by so many analysts and forward-thinking firms that you’re probably sick of hearing about it.

Guess what? You are not Google. Or Facebook. Or LinkedIn. You are not solving massive problems at the scale that they are solving them. Your 50-person office does not need Cassandra or Hadoop or TensorFlow. Why?

Google Has Massive Scale – Ben mentioned it in the video above. The published scale of Google is massive, and even it’s on the low side of the number. The real numbers could even be an order of magnitude higher than what we realize. When you have to start quoting throughput numbers in “Library of Congress” numbers to make sense to normal people, you’re in a class by yourself.
Google Builds Solutions For Their Problems – It’s all well and good that Google has built a ton of tools to solve their issues. It’s even nice of them to have shared those tools with the community through open source. But realistically speaking, when are you really going to use Cassandra to solve all but the most complicated and complex database issues? It’s like a guy that goes out to buy a pneumatic impact wrench to fix the training wheels on his daughter’s bike. Sure, it will get the job done. But it’s going to be way overpowered and cause more problems than it solves.
Google’s Tools Don’t Solve Your Problems – This is the crux of Ben’s argument above. Google’s tools aren’t designed to solve a small flow issue in an SME network. They’re designed to keep the lights on in an organization that maps the world and provides video content to billions of people. Google tools are purpose built. And they aren’t flexible outside that purpose. They are built to be scalable, not flexible.

Down To Earth

Since Google’s scale numbers are hard to comprehend, let’s look at a better example from days gone by. I’m talking about the Cisco Aironet-to-LWAPP Upgrade Tool:

I used this a lot back in the day to upgrade autonomous APs to LWAPP controller-based APs. It was a very simple tool. It did exactly what it said in the title. And it didn’t do much more than that. You fed it an image and pointed it at an AP and it did the rest. There was some magic on the backend of removing and installing certificates and other necessary things to pave the way for the upgrade, but it was essentially a batch TFTP server.

It was simple. It didn’t check that you had the right image for the AP. It didn’t throw out good error codes when you blew something up. It only ran on a maximum of 5 APs at a time. And you had to close the tool every three or four uses because it had a memory leak! But, it was a still a better choice than trying to upgrade those APs by hand through the CLI.

This tool is over ten years old at this point and is still available for download on Cisco’s site. Why? Because you may still need it. It doesn’t scale to 1,000 APs. It doesn’t give you any other functionality other than upgrading 5 Aironet APs at a time to LWAPP (or CAPWAP) images. That’s it. That’s the purpose of the tool. And it’s still useful.

Tools like this aren’t built to be the ultimate solution to every problem. They don’t try to pack in every possible feature to be a “single pane of glass” problem solver. Instead, they focus on one problem and solve it better than anything else. Now, imagine that tool running at a scale your mind can’t comprehend. And you’ll know now why Google builds their tools the way they do.

Tom’s Take

I have a constant discussion on Twitter about the phrase “begs the question”. Begging the question is a logical fallacy. Almost every time the speaker really means “raises the question”. Likewise, every time you think you need to use a Google tool to solve a problem, you’re almost always wrong. You’re not operating at the scale necessary to need that solution. Instead, the majority of people looking to implement Google solutions in their networks are like people that put chrome everything on a car. They’re looking to show off instead of get things done. It’s time to retire the Google Cargo Cult and instead ask ourselves what problems we’re really trying to solve, as Ben Sigelman mentions above. I think we’ll end up much happier in the long run and find our work lives much less complicated.

The Privacy Pickle

Posted on July 6, 2018 by networkingnerd

I recorded a fantastic episode of The Network Collective last night with some great friends from the industry. The topic was privacy. Originally I thought we were just going to discuss how NAT both was and wasn’t a form of privacy and how EUI-64 addressing wasn’t the end of days for people worried about being tracked. But as the show wore on, I realized a few things about privacy.

Booming In Peace

My mom is a Baby Boomer. We learn about them as a generation based on some of their characteristics, most notably their rejection of the values of their parents. One of things they hold most dear is their privacy. They grew up in a world where they could be private people. They weren’t living in a 1 or 2 room house with multiple siblings. They had the right of privacy. They could have a room all to themselves if they so chose.

Baby Boomers, like my mom, are intensely private adults. They marvel at the idea that targeted advertisements can work for them. When Amazon shows them an ad for something they just searched for they feel like it’s a form of dark magic. They also aren’t trusting of “new” things. I can still remember how shocked my mother was that I would actively get into someone else’s car instead of a taxi. When I explained that Uber and Lyft do a similar job of vetting their drivers it still took some convincing to make her realize that it was safe.

Likewise, the Boomer generation’s personal privacy doesn’t mesh well with today’s technology. While there are always exceptions to every rule, the number of people in their mid-50s and older that use Twitter and Snapchat are far, far less than the number that is the target demographic for each service. I used to wonder if it was because older people didn’t understand the technology. But over time I started to realize that it was more based on the fact that older people just don’t like sharing that kind of information about themselves. They’re not open books. Instead, Baby Boomers take a lot of studying to understand.

Zee Newest

On the opposite side of the spectrum is my son’s generation, Generation Z. GenZ is the opposite of the Boomer generation when it comes to privacy. They have grown up in a world that has never known anything but the ever-present connectivity of the Internet. They don’t understand that people can live a life without being watched by cameras and having everything they do uploaded to YouTube. Their idea of celebrity isn’t just TV and movie stars but also extends to video game streamers on Twitch or Instagram models.

Likewise, this generation is more open about their privacy. They understand that the world is built on data collection. They sign away their information. But they tend to be crafty about it. Rather than acting like previous generations that would fill out every detail of a form this generation only fills out the necessary pieces. And they have been known to put in totally incorrect information for no other reason than to throw people off.

GenZ realizes that the privacy genie is out of the bottle. They have to deal with the world they were born into, just like the Baby Boomers and the other generations that came before them. But the way that they choose to deal with it is not through legislation but instead through self-regulation. They choose what information they expose so as not to create a trail or a profile that big data consuming companies can use to fingerprint them. And in most cases, they don’t even realize they’re doing it! My son is twelve and he instinctively knows that you don’t share everything about yourself everywhere. He knows how to navigate his virtual neighborhood just a sure as I knew how to ride my bike around my physical one back when I was his age.

Tom’s Take

Where does that leave me and my generation? Well, we’re a weird mashup on Generation X and Generation Y/Millenials. We aren’t as private as our parents and we aren’t as open as our children. We’re cynical. We’re rebelling against what we see as our parent’s generation and their complete privacy. Likewise, just like our parents, we are almost aghast at the idea that our children could be so open. We’re coming to live in a world where Big Data is learning everything about us. And our children are growing up in that world. Their children, the generation after GenZ, will only know a world where everyone knows everything already. Will it be like Minority Report, where advertising works with retinal patterns? Or will it be a generation where we know everything but really know nothing because no one tells the whole truth about who they are?

Data Is Not The New Oil, It’s Nuclear Power

Posted on December 20, 2017 by networkingnerd

Big Data. I believe that one phrase could get millions in venture capital funding. I don’t even have to put a product with it. Just say it. And make no mistake about it: the rest of the world thinks so too. Data is “the new oil”. At least, according to some pundits. It’s a great headline making analogy that describes how data is driving business and controlling it can lead to an empire. But, data isn’t really oil. It’s nuclear power.

Black Gold, Texas Tea

Crude oil is a popular resource. Prized for a variety of uses, it is traded and sold as a commodity and refined into plastics, gasoline, and other essential items of modern convenience. Oil creates empires and causes global commerce to hinge on every turn of the market. Living in a state that is a big oil producer, the exploration and refining of oil has a big impact.

However, when compared to Big Data, oil isn’t the right metaphor. Much like oil, data needs to be refined before use. But oil can be refined into many different distinct things. Data can only be turned into information. Oil burns up when consumed. Aside from some smoke and a small amount of residuals oil all but disappears after it expends the energy trapped within. Data doesn’t disappear after being turned into information.

In fact, the biggest issue that I have with the entire “Data as Oil” argument is that oil doesn’t stick around. We don’t see massive pools of oil on the side of the road from spills. We don’t hear about our massive issues with oil disposal or securing our spent oil to prevent theft. People that treat data like oil are only looking at the refined product as the final form. They tend to forget that the raw form of data sticks around after the transformation. Most people will tell you that’s a good thing because you can run analytics and machine learning against static datasets and continue to derive value from it. But doesn’t make me think of oil at all.

Welcome To The Nuclear Age

In fact, Big Data reminds me most of a nuclear power plant. Much like oil, the initial form of radioactive material isn’t very useful. It radiates and creates a small amount of heat but not enough to run the steam generators in a power plant. Instead, you must bombard the uranium 235 pellets with neutrons to start a fission reaction. Once you have a sustained controllable reaction the amount of generated heat rises and creates the resource you need to power the rest of your machinery.

Much like data, nuclear fission reactions don’t do much without the proper infrastructure to harness them. Even after you transform you data into information you need to parse, categorize, and analyze it. The byproduct of the transformation is the critical part of the whole process.

Much like nuclear fuel rods, data stays in place for years. It continues to produce the resource after being modified and transformed. It sits around in the hopes that it can be useful until the day that it no longer serves its purpose. Data that is past the useful shelf life goes into a data warehouse when it will eventually be forgotten. Spent nuclear fuel rods are also eventually removed and placed somewhere where they can’t affect other things. Maybe it’s buried deep underground. Or shot into space. Or placed under enough concrete that they will never be found again.

The danger in data and in nuclear power is not what happens when everything goes right. Instead, it’s what happens when everything goes wrong. With nuclear power, wrong is a chain reaction meltdown. Or wrong could be improper disposal of waste. It could be a disaster at the plant or even a theft of fissile nuclear material from the plant. The fuel rods themselves are simultaneously our source of power and the source of our potential disaster.

Likewise, data that is just sitting around and stored improperly can lead to huge disasters. We’re only four years removed from Target’s huge data breach. And how many more are waiting out there to happen? It seems the time between data leaks is shrinking as more and more bad actors are finding ways to steal, manipulate, and appropriate data for their own ends. And, much like nuclear fuel rods, the methods of protecting the data are few compared to other fuel sources.

Data isn’t something that can easily be hidden or compacted. It needs to be readable to be useful. It needs to be fast to be useful. All the things that make it easy to use also make it easy to exploit. And once we’re done with it, the way that it is stored in perpetuity only increase the likelihood of it being used improperly. Unless we’re willing to bury it under metaphorical concrete we’re in for a bad future if we forget how to handle spent data.

Tom’s Take

Data as Oil is a stupid metaphor. It’s meant to impress upon finance CEOs and Wall Street wonks how important it is for data to be taken seriously. Data as Oil is something a data scientist would say to get a job. By drawing a bad comparison, you make data seem like a commodity to be traded and used as collateral for empire building. It’s not. It’s a ticking bomb of disastrous proportions when not handled correctly. Rather than coming up with a pithy metaphor for cable news consumption and page views, let’s treat data with the respect it deserves and make sure we plan for how we’re going to deal with something that won’t burn up into smoke whenever it’s convenient.

Devaluing Data Exposures

Posted on October 27, 2017 by networkingnerd

I had a great time this week recording the first episode of a new series with my co-worker Rich Stroffolino. The Gestalt IT Rundown is hopefully the start of some fun news stories with a hint of snark and humor thrown in.

One of the things I discussed in this episode was my belief that no data is truly secure any more. Thanks to recent attacks like WannaCry and Bad Rabbit and the rise of other state-sponsored hacking and malware attacks, I’m totally behind the idea that soon everyone will know everything about me and there’s nothing that anyone can do about it.

Just Pick Up The Phone

Personal data is important. Some pieces of personal data are sacrificed for the greater good. Anyone who is in IT or works in an area where they deal with spam emails and robocalls has probably paused for a moment before putting contact information down on a form. I have an old Hotmail address I use to catch spam if I’m relative certain that something looks shady. I give out my home phone number freely because I never answer it. These pieces of personal data have been sacrificed in order to provide me a modicum of privacy.

But what about other things that we guard jealously? How about our mobile phone number. When I worked for a VAR that was the single most secretive piece of information I owned. No one, aside from my coworkers, had my mobile number. In part, it’s because I wanted to make sure that it got used properly. But also because I knew that as soon as one person at the customer site had it, soon everyone would. I would be spending my time answering phone calls instead of working on tickets.

That’s the world we live in today. So many pieces of information about us are being stored. Our Social Security Number, which has truthfully been misappropriated as an identification number. US Driver’s Licenses, which are also used as identification. Passport numbers, credit ratings, mother’s maiden name (which is very handy for opening accounts in your name). The list could be a blog post in and of itself. But why is all of this data being stored?

Data Is The New Oil

The first time I heard someone in a keynote use the phrase “big data is the new oil”, I almost puked. Not because it’s a platitude the underscores the value of data. I lost it because I know what people do with vital resources like oil, gold, and diamonds. They horde them. Stockpiling the resources until they can be refined. Until every ounce of value can be extracted. Then the shell is discarded until it becomes a hazard.

Don’t believe me? I live in a state that is legally required to run radio and television advertisements telling children not to play around old oilfield equipment that hasn’t been operational in decades. It’s cheaper for them to buy commercials than it is to clean up their mess. And that precious resource? It’s old news. Companies that extract resources just move on to the next easy source instead of cleaning up their leftovers.

Why does that matter to you? Think about all the pieces of data that are stored somewhere that could possibly leak out about you. Phone numbers, date of birth, names of children or spouses. And those are the easy ones. Imagine how many places your SSN is currently stored. Now, imagine half of those companies go out of business in the next three years. What happens to your data then? You can better believe that it’s not going to get destroyed or encrypted in such a way as to prevent exposure. It’s going to lie fallow on some forgotten server until someone finds it and plunders it. Your only real hope is that it was being stored on a cloud provider that destroys the storage buckets after the bill isn’t paid for six months.

Devaluing Data

How do we fix all this? Can this be fixed? Well, it might be able to be done, but it’s not going to be fun, cheap, or easy. It all starts by making discrete data less valuable. An SSN is worthless without a name attached to it, for instance. If all I have are 9 random numbers with no context I can’t tell what they’re supposed to be. The value only comes when those 9 numbers can be matched to a name.

We’ve got to stop using SSN as a unique identifier for a person. It was never designed for that purpose. In fact, storing SSN as all is a really bad idea. Users should be assigned a new, random ID number when creating an account or filling out a form. SSN shouldn’t be stored unless absolutely necessary. And when it is, it should be treated like a nuclear launch code. It should take special authority to query it, and the database that queries it should be directly attached to anything else.

Critical data should be stored in a vault that can only be accessed in certain ways and never exposed. A prime example is the trusted enclave in an iPhone. This enclave, when used for TouchID or FaceID, stores your fingerprints and your face map. Pretty important stuff, yes? However, even with biometric ID systems become more prevalent there isn’t any way to extract that data from the enclave. It’s stored in such a way that it can only be queried in a specific manner and a result of yes/no returned from the query. If you stole my iPhone tomorrow, there’s no way for you to reconstruct my fingerprints from it. That’s the template we need to use going forward to protect our data.

Tom’s Take

I’m getting tired of being told that my data is being spread to the four winds thanks to it lying around waiting to be used for both legitimate and nefarious purposes. We can’t build fences high enough around critical data to keep it from being broken into. We can’t keep people out, so we need to start making the data less valuable. Instead of keeping it all together where it can be reconstructed into something of immense value, we need to make it hard to get all the pieces together at any one time. That means it’s going to be tough for us to build systems that put it all together too. But wouldn’t you rather spend your time solving a fun problem like that rather than making phone calls telling people your SSN got exposed on the open market?

Don’t Build Big Data With Bad Data

Posted on June 16, 2017 by networkingnerd

I was at Pure Accelerate 2017 this week and I saw some very interesting things around big data and the impact that high speed flash storage is going to have. Storage vendors serving that market are starting to include analytics capabilities on the box in an effort to provide extra value. But what happens when these advances cause issues in the training of algorithms?

Garbage In, Garbage Out

One story that came out of a conversation was about training a system to recognize people. In the process of training the system, the users imported a large number of faces in order to help the system start the process of differentiating individuals. The data set they started with? A collection of male headshots from the Screen Actors Guild. By the time the users caught the mistake, the algorithm had already proven that it had issues telling the difference between test subjects of particular ethnicities. After scrapping the data set and using some different diverse data sources, the system started performing much better.

This started me thinking about the quality of the data that we are importing into machine learning and artificial intelligence systems. The old computer adage of “garbage in, garbage out” is never more apt today than it has been in history. Before, bad inputs caused data to be suspect when extracted. Now, inputting bad data into a system designed to make decisions can have even more far-reaching consequences.

Look at all the systems that we’re programming today to be more AI-like. We’ve got self-driving cars that need massive data inputs to help navigate roads at speed. We have network monitoring systems that take analytics data and use it to predict things like component failures. We even have these systems running the background of popular apps that provide us news and other crucial information.

What if the inputs into the system cause it to become corrupted or somehow compromised? You’ve probably heard the story about how importing UrbanDictionary into Watson caused it to start cursing constantly. These kinds of stories highlight how important the quality of data being used for the basis of AI/ML systems can be.

Think of a future when self-driving cars are being programmed with failsafes to avoid living things in the roadway. Suppose that the car has been programmed to avoid humans and other large animals like horses and cows. But, during the import of the small animal data set, the table for dogs isn’t imported for some reason. Now, what would happen if the car encountered a dog in the road? Would it make the right decision to avoid the animal? Would the outline of the dog trigger a subroutine that helped it make the right decision? Or would the car not be able to tell what a dog was and do something horrible?

Do You See What I See?

After some chatting with my friend Ryan Adzima, he taught me a bit about how facial recognition systems work. I had always assumed that these systems could differentiate on things like colors. So it could tell a blond woman from a brunette, for instance. But Ryan told me that it’s actually very difficult for a system to tell fine colors apart.

Instead, systems try to create contrast in the colors of the picture so that certain features stand out. Those features have a grid overlaid on them and then those grids are compared and contrasted. That’s the fastest way for a system to discern between individuals. It makes sense considering how CPU-bound things are today and the lack of high definition cameras to capture information for the system.

But, we also must realize that we have to improve data collection for our AI/ML systems in order to ensure that the systems are receiving good data to make decisions. We need to build validation models into our systems and checks to make sure the data looks and sounds sane at the point of input. These are the kinds of things that take time and careful consideration when planning to ensure they don’t become a hinderance to the system. If the very safeguards we put in place to keep data correct end up causing problems, we’re going to create a system that falls apart before it can do what it was designed to do.

Tom’s Take

I thought the story about the AI training was a bit humorous, but it does belie a huge issue with computer systems going forward. We need to be absolutely sure of the veracity of our data as we begin using it to train systems to think for themselves. Sure, teaching a Jeopardy-winning system to curse is one thing. But if we teach a system to be racist or murderous because of what information we give it to make decisions, we will have programmed a new life form to exhibit the worst of us instead of the best.

AI, Machine Learning, and The Hitchhiker’s Guide

Posted on August 31, 2016 by networkingnerd

Deep_Thought

I had a great conversation with Ed Horley (@EHorley) and Patrick Hubbard (@FerventGeek) last night around new technologies. We were waxing intellectual about all things related to advances in analytics and intelligence. There’s been more than a few questions here at VMworld 2016 about the roles that machine learning and artificial intelligence will play in the future of IT. But during the conversation with Ed and Patrick, I finally hit on the perfect analogy for machine learning and artificial intelligence (AI). It’s pretty easy to follow along, so don’t panic.

The Answer

Machine learning is an amazing technology. It can extrapolate patterns in large data sets and provide insight from seemingly random things. It can also teach machines to think about problems and find solutions. Rather than go back to the tired Target big data example, I much prefer this example of a computer learning to play Super Mario World:

You can see how the algorithms learn how to play the game and find newer, better paths throughout the level. One of the things that’s always struck me about the computer’s decision skills is how early it learned that spin jumps provide more benefit than regular jumps for a given input. You can see the point in the video when this is figured out by the system, whereafter all jumps become spinning for maximum effect.

Machine learning appears to be insightful and amazing. But the weakness of machine learning is be exemplified by Deep Thought from The Hitchhiker’s Guide to the Galaxy. Deep Thought was created to find the answer to the ultimate question of life, the universe, and everything. It was programmed with an enormous dataset – literally every piece of knowledge in the known universe. After seven million years, it finally produces The Answer (42, if you’re curious). Which leads to the plot of the book and other hijinks.

Machine learning is capable of great leaps of logic, but it operates on a fundamental truth: all inputs are a bounded data set. Whether you are performing a simple test on a small data set or combing all the information in the universe for answers you are still operating on finite information. Machine learning can do things very fast and find pieces of data that are not immediately obvious. But it can’t operate outside the bounds of the data set without additional input. Even the largest analytics clusters won’t produce additional output without more data being ingested. Machine learning is capable of doing amazing things. But it won’t ever create new information outside of what it is operating on.

The Question

Artificial Intelligence (AI), on the other hand, is more like the question in The Hitchhiker’s Guide. Deep Thought admonishes the users of the system that rather than looking for the answer to Life, The Universe, and Everything, they should have been looking for The Question instead. That involves creating a completely different computer to find the Question that matches the Answer that Deep Thought has already provided.

AI can provide insight above and beyond a given data set input. It can provide context where none exists. It can make leaps of logic similarly to those that humans are capable of doing. AI doesn’t simply stop when it faces an incomplete data set. Even though we are seeing AI in infancy today, the most advanced systems are capable of “filling in the blanks” to cover missing information. As the algorithms learn more and more how to extrapolate they’ll become better at making incomplete decisions.

The reason why computers are so good at making quick decisions is because they don’t operate outside the bounds of the possible. If the entire universe for a decision is a data set, they won’t try to look around that. That ability to look beyond and try to create new data where none exists is the hallmark of intelligence. Using tools to create is a uniquely biologic function. Computers can create subsets of data with tools but they can’t do a complete transformation.

AI is pushing those boundaries. Given enough time and the proper input, AI can make the leaps outside of bounds to come up with new ideas. Today it’s all about context. Tomorrow may find AI providing true creativity. AI will eventually pass a Turing Test because it can look outside the script and provide the pseudorandom type of conversation that people are capable of.

Tom’s Take

Computers are smart. They think faster than we do. They can do math better than we can. They can produce results in a fraction of the time at a scale that boggles the mind. But machine learning is still working from a known set. No matter how much work we pour into that aspect of things, we are still going to hit a limit of the ability of the system to break out of it’s bounds.

True AI will behave like we do. It will look for answers when their are none. It will imagine when confronted with the impossible. It will learn from mistakes and create solutions to them. That’s the power of intelligence versus learning. Even the most power computer in the universe couldn’t break out of its programming. It needed something more to question the answers it created. Just like us, it needed to sit back and let its mind wander.

Networking Needs Information, Not Data

Posted on August 10, 2016 by networkingnerd

GameAfoot

Networking Field Day 12 starts today. There are a lot of great presenters lined up. As I talk to more and more networking companies, it’s becoming obvious that simply moving packets is not the way to go now. Instead, the real sizzle is in telling you all about those packets instead. Not packet inspection but analytics.

Tell Me More, Tell Me More

Ask any networking professional and they’ll tell you that the systems they manage have a wealth of information. SNMP can give you monitoring data for a set of points defined in database files. Other protocols like NetFlow or sFlow can give you more granular data about a particular packet group of data flow in your network. Even more advanced projects like Intel’s Snap are building on the idea of using telemetry to collect disparate data sources and build collection methodologies to do something with them.

The concern that becomes quickly apparent is the overwhelming amount of data being received from all these sources. It reminds me a bit of this scene:

How can you drink from this firehose? Maybe you should be asking if you should instead?

Order From Chaos

Data is useless. We need to perform analysis on it to get information. That’s where a new wave of companies is coming into the networking market. They are building on the frameworks and systems that are aggregating data and presenting it in a way that makes it useful information. Instead of random data points about NetFlow, these solutions tell you that you’ve got a huge problem with outbound traffic of a specific type that is sent at a specific time with a specific payload. The difference is that instead of sorting through data to make sense of it, you’ve got a tool delivering the analysis instead of the raw data.

Sometimes it’s as simple as color-coding lines of Wireshark captures. Resets are bad, so they show up red. Properly torn down connections are good so they are green. You can instantly figure out how good things are going by looking for the colors. That’s analysis from raw data. The real trick in modern networking monitoring is to find a way to analyze and provide context for massive amounts of data that may not have an immediate correlation.

Networking professionals are smart people. They can intuit a lot of potential issues from a given data set. They can make the logical leap to a specific issue given time. What reduces that ability is the sheer amount of things that can go wrong with a particular system and the speed at which those problems must be fixed, especially at scale. A hiccup on one end of the network can be catastrophic on the others if allowed to persist.

Analytics can give us the context we need. It can provide confidence levels for common problems. It can ensure that symptoms are indeed happening above a given baseline or threshold. It can help us narrow the symptoms and potential issues before we even look at the data. Analytics can exclude the impossible while highlighting the more probably causes and outcomes. Analytics can give us peace of mind.

Tom’s Take

Analytics isn’t doing our job for us. Instead, it’s giving us the ability to concentrate. Anyone that spends their time sifting through data to try and find patterns is losing the signal in the noise. Patterns are things that software can find easily. We need to leverage the work being put into network analytics systems to help us track down the issues before they blow up into full problems. We need to apply the thing that makes network professionals the best suited to look at the best information we can gather about a situation. Our concentration on what matters is where our job will be in five years. Let’s take the knowledge we have and apply it.

The Networking Nerd

Networking With A Side of Snark

Category Archives: Big data

The Dangers of Knowing Everything

Unknown Unknowns

Spinning AI Yarns

Tom’s Take

Data Is The New Solar Energy

As Sure As The Sun Comes Up

Tom’s Take

The Cargo Cult of Google Tools

Building a Laser Moustrap

Down To Earth

Tom’s Take

The Privacy Pickle

Booming In Peace

Zee Newest

Tom’s Take

Data Is Not The New Oil, It’s Nuclear Power

Black Gold, Texas Tea

Welcome To The Nuclear Age

Tom’s Take

Devaluing Data Exposures

Just Pick Up The Phone

Data Is The New Oil

Devaluing Data

Tom’s Take

Don’t Build Big Data With Bad Data

Garbage In, Garbage Out

Do You See What I See?

Tom’s Take

AI, Machine Learning, and The Hitchhiker’s Guide

The Answer

The Question

Tom’s Take

Networking Needs Information, Not Data

Tell Me More, Tell Me More

Order From Chaos

Tom’s Take

Can AI Really Work for Enterprises?

No Free Lunchboxes

Tom’s Take

Share this:

Unknown Unknowns

Spinning AI Yarns

Tom’s Take

Share this:

As Sure As The Sun Comes Up

Tom’s Take

Share this:

Building a Laser Moustrap

Down To Earth

Tom’s Take

Share this:

Booming In Peace

Zee Newest

Tom’s Take

Share this:

Black Gold, Texas Tea

Welcome To The Nuclear Age

Tom’s Take

Share this:

Just Pick Up The Phone

Data Is The New Oil

Devaluing Data

Tom’s Take

Share this:

Garbage In, Garbage Out

Do You See What I See?

Tom’s Take

Share this:

The Answer

The Question

Tom’s Take

Share this:

Tell Me More, Tell Me More

Order From Chaos

Tom’s Take

Share this: