The Network Does Too Much

I’m at Networking Field Day this week and it’s good to be back in person around other brilliant engineers and companies. One of the other fun things that happens at Networking Field Day is that I get to chat with folks that help me think about things in new ways and come up with awesome ideas for networking blog posts.

One of the ones that was discussed quickly this week really got me thinking again about fragility and complexity. Thanks to Carl Fugate for reminding me about it. Essentially, networks are inherently unstable because they are doing far too much heavy lifting.

Swiss Army Design

Have you heard about the AxeSaw Reddit? It’s a page dedicated to finding silly tools that attempt to combine too many things together into one package that make the overall tool less useful. Like making a combination shovel and axe that isn’t easy to operate because you have to hold on to the shovel scoop as the handle for the axe and so on. It’s a goofy take on a trend of trying to make things too compact at the sake of usability.

Networking has this issue as well. I’ve talked about it before here but nothing has really changed in the intervening time since that post five years ago. The developers that build applications and systems still rely on the network to help solve challenges that shouldn’t be solved in the network. Things like first hop reachability protocols (FHRP) are perfect examples. Back in the dark ages, systems didn’t know how to handle what happened when a gateway disappeared. They knew they needed to find a new one eventually when the MAC address age timed out. However, for applications that couldn’t wait there needed to be a way to pick up the packets and keep them from timing out.

Great idea in theory, right? But what if the application knew how to handle that? Things like Cisco CallManager have been able to designate backup servers for years. Applications built for the cloud know how to fail over properly and work correctly when a peer disappears or a route to a resource fails. What happened? Why did we suddenly move from a model where you have to find a way to plan for failure with stupid switching tricks to a world where software just works as long as Amazon is online?

The answer is that we removed the ability for those stupid tricks to work without paying a high cost. You want to use Layer 2 tricks to fake NHRP? If it’s even available in the cloud you’re going to be paying a fortune for it. AWS wants you to use their tools that they optimize for. If you want to do things the way you’ve always done you can but you need to pay for that privileges.

With cost being the primary driver for all things, increased costs for stupid switching tricks have now given way to better software development. Instead of paying thousands of dollars a month for a layer 2 connection to run something like HSRP you can instead just make the application start searching for a new server when the old one goes away. You can write it to use DNS instead of IP so you can load balance or load share. You can do many, many exciting and wonderful things to provide a better experience that you wouldn’t have even considered before because you just relied on the networking team to keep you from shooting yourself in the foot.

Network Complexity Crunch

If the cloud forces people to use their solutions for reliability and such, that means the network is going to go away, right? It’s the apocalypse for engineer jobs. We’re all going to get replaced by a DevOps script. And the hyperbole continues on.

Reality is that networking engineers will still be as needed as highway engineers are even though cars are a hundred times safer now than in the middle of the 20th century. Just because you’ve accommodated for something that you used to be forced to do doesn’t mean you don’t need to build the pathway. You still need roads and networks to connect things that want to communicate.

What it means for the engineering team is an increased focus on providing optimal reliable communications. If we remove the need to deal with crazy ARP tricks and things like that we can focus on optimizing routing to provide multiple paths and ensure systems have better communications. We could even do crazy things like remove our reliability of legacy IP because the applications will survive a transition when they aren’t beholden to IP address or ARP to prevent failure.

Networking will become a utility. Like electricity or natural gas it won’t be visible unless it’s missing. Likewise, you don’t worry about the utility company solving issues about delivery to your home or business. You don’t ask them to provide backup pipelines or creative hacks to make it work. You are handed a pipeline that has bandwidth and the service is delivered. You don’t feel like the utility companies are outdated or useless because you’re getting what you pay for. And you don’t have to call them every time the heater doesn’t work or you flip a breaker. Because that infrastructure is on your side instead of theirs.


Tom’s Take

I’m ready for a brave new world where the network is featureless and boring. It’s an open highway with no airbags along the shoulder to prevent you from flying off the road. No drones designed to automatically pick you up and put you on the correct path to your destination if you missed your exit. The network is the pathway, not the system that provides all the connection. You need to rely on your systems and your applications to do the heavy lifting. Because we’ve finally found a solution that doesn’t allow the networking team to save the day we can absolutely build a world where the clients are responsible for their own behavior. The network needs to do what it was designed to do and no more. That’s how you solve complexity and fragility. Less features means less to break.

Holiday Networking Thoughts from 2021

It’s the Christmas break for 2021, which means lots of time spent doing very little work-related stuff. I’m currently putting together a Lego set, playing Metroid Dread and working on beating Ocarina of Time again.

As I waited for updates to download on Christmas morning I remembered how many packets must be flying across the wire to update software and operating systems for consoles. Even having done a few of the updates the night before I could see the traffic to those servers started to get a bit congested. It’s like Black Friday but for the latest patches to keep your games running. Add in the content that needs to be installed now in order to make that game disc work, or the download-only consoles for sale, and you can see that network engineers aren’t going to be a dying profession any time soon.

I’m a bit jaded because I come from a time when you didn’t need to be constantly connected to use software or need to download an update every few days. Heck, some of the bugs in Ocarina of Time have been there for over twenty years because those cartridges are not designed to be patched, having been created before a time when you could barely get online with a modem, let alone wirelessly connect a console.

I also am happy that upgrading devices in the house means fewer and fewer older units performing poorly on the wireless network. As more devices require me to connect them to the network for updates or app connectivity, I’m reminded that things like the Xbox 360 need low data rates enabled to work properly and that makes me sad. But I also can’t turn them off for fear that nothing will work and my children will scream. I don’t think spending a ton of money to get rid of an 802.11b client is really that big of a deal but I’m happy to see them go when I get the chance.

Likewise, I’m going to need to upgrade my APs a bit now that I have clients that can actually use 802.11ax (Wi-Fi 6). Even the older clients will get a performance boost. So It’s a matter of catching a good AP on sale and getting it done. Since I don’t use big box APs I just have to look a bit harder.


Tom’s Take

Make sure you give a shoutout to your friendly neighborhood network engineer for all their hard work making sure the services we’re currently consuming stayed up while the skeleton crew was carrying the pager this weekend. We’ve seen a lot of services crash on Christmas morning in recent years because of unexpected load. Also, give yourselves a hand for keeping your own network up long enough to download the latest DLC for a game or ensure that your new smart appliance can talk to the fancy app you need to use to control it. Let’s make it through the rest of the year with the change freeze intact and start 2022 off on the right foot with no outages.

Networking Isn’t Just A Tool

BlueFiberOptic

It’s another event week for me at Networking Field Day 25 and I’m continually impressed with the level of technology that we see in the networking world. I think back to how things looked when I was still deploying the networks I built and it seems like a hundred years ago instead of a decade. More software driving better outcomes for users. Easier collection of analytics and telemetry to understand how to tune things and make them faster and better. And, honestly, more need for advanced technical people to tune everything and make it work better.

When you consider that the last year has been done over the Internet for most of us it gets even crazier. Meetings, software productivity, and even food delivery has been driven by apps running in the cloud that we communicate with over the Internet. I can remember a time when I didn’t have a mobile phone in my pocket with Internet capabilities. Today I can barely imagine not having it at my fingertips. When the network is not doing things the way we want we quickly find out how dependent we’ve become on our connectivity.

Generational Differences

My children are amazed that dial-up networking used to be a thing. I remember rebuilding Winsock stacks in Windows 98 for Gateway in order to troubleshoot 56k modems not connecting to AOL at the beginning of the millennium. Today my cable modem gets me where I need to go when I’m home and my 5G phone does the heavy lifting for me when I’m on the road. Need to look up a price on something? Or know the temperature? Or just listen to a song you remember from your childhood? It’s all at your fingertips. I can’t imagine that kind of connectivity back when it took a minute or two for the phone to scream out the song of Compuserve at my house.

My in-laws have a DSL connection that suits them just fine for their needs. It’s painfully slow for me. I couldn’t live with their slow connection and inability to run multiple things at once, like video streaming and Zoom meetings at the same time. They don’t live very far from me either. The difference in their connectivity is shocking. And yet, we just expect to be able to get online any time we want.

Remember ATTWIFI at Starbucks? Remember when your iPhone would automatically connect to give you better speeds when your 3G was overwhelmed? I can recall getting into a situation where Cisco Live in Las Vegas made my phone unusable outside of the conference. Today that situation would be unacceptable. And we’re barely a decade removed from those days.

As I keep seeing technology moving along even faster, including things like silicon photonics promising speeds north of hundreds of gigabits on the uplink side, I wonder how our next generation is going to feel about not being able to watch 8k TV shows in a self-driving car on-demand because there’s not enough bandwidth. I laugh when I remember the need to swap out DVDs on car trips so my eldest son could have entertainment. Today my youngest is happy to binge watch shows on Disney without interruption because of the networks we’ve built.

Creating Dependence

What we’ve built has created the world we live in. But we also have made it a world dependent on what we’ve built. I realized that months ago when my network connection kept going out during a winter storm. Without connectivity people feel lost. I had a hard time getting things done offline without being able to look up information or get emails sent out. My kids are beside themselves without access to anything online. Their board games were boring. They couldn’t play video games offline because all the cool features were on the Internet. By the time the connection came back it was almost Lord of the Flies around here as the minutes ticked on.

We no longer have the luxury of shrugging our shoulders when the network goes down. It needs to be treated no differently than the electricity or water in a building. If we neglect it we risk alienating our users and stakeholders. We need to be firm when we need new equipment or better designs to ensure resilience. Instead of making everything cheap and barely usable we need to remind everyone how reliant they’ve become on the network. If it’s necessary it is absolutely worth investing in. Moving to the cloud or becoming more and more reliant on SaaS applications just reinforces those decisions.


Tom’s Take

Either the network is just a tool that doesn’t need investment or it’s a necessary part of your work that needs to be treated as such. While I would never suggest unplugging anything to prove a point I think you can point to specific outages that would do the same thing without the chaos. Every time you tell your stakeholders they need to invest in better switches or new access points and they push back about costs or try to suggest a cheaper alternative, you need to stand firm. In a world where everyone is dependent on Internet connectivity for all manner of their lives you have to treat it as a necessity in every possible way. You can tell your stakeholders to spend their day working from their phone hotspot if they don’t believe you. It’ll be like taking a trip back to the early parts of the millennium when networks weren’t as important.

Who Pays The Price of Redundancy?

No doubt by now you’ve seen the big fire that took out a portion of the OVHcloud data center earlier this week. These kinds of things are difficult to deal with on a good day. This is why data centers have reductant power feeds, fire suppression systems, and the ability to get back up to full capacity. Modern data centers are getting very good at ensuring they can stay up through most events that could impact an on-premises private data center.

One of the issues I saw that was ancillary to the OVHcloud outage was the small group of people that were frustrated that their systems went down when the fire knocked out the racks where their instances lived. More than a couple of comments mentioned that clouds should not go down like this or asked about credit for time spent being offline or some form of complaints about unavailability. By and large, most of those complaining were running non-critical systems or were using the cheapest possible instances for their hosts.

Aside from the myopia that “cloud shouldn’t go down”, how do we deal with this idea that cloud redundancy doesn’t always translate to single instance availability? I think we need to step back and educate people about their responsibilities to their own systems and who ultimately pays for redundancy.

Backup Plans

I mentioned earlier that big accidents or incidents are the reasons why public cloud data centers have redundant systems. They have separate power feeds, generator power backups, extra cabling to prevent cuts from taking down systems, redundant cooling systems to keep things cold, and even redundant network connectivity across a variety of providers.

What is the purpose of all of this redundancy? Is it for the customers or the provider? The reason why all this exists is because the provider needs to ensure that it will take a massive issue to interrupt service to the customer. In a private data center you can be knocked offline when the primary link goes down or a UPS decides to die on you. In a public data center you could knock out ten customers with a dead rack PDU. So it is in their best interests to ensure that they have the right redundancy built in to keep their customers happy.

Public data centers pay for all of this redundancy to keep their customers coming back each month. If your data center gets knocked offline for some simple issue you can better believe you’re going to be shopping for a new partner. You’re paying for their redundancy with your month billing cycle. Sure, that massive generator may be sitting there just in case they need it. But they’re recouping the cost of having it around by charging you a few extra cents each cycle.

Extending the Backup Bubble

What about providing redundancy for your applications and servers, though? Like the OVHcloud issue above, why doesn’t the provider just back my stuff up or move it to a different server when everything goes down? I mean, vMotion and DRS do it in my private data center. Can’t they just check a box and make it happen?

There are two main reasons why this doesn’t happen in public cloud right now. The first is pretty easy. Having to backup, restore, replicate, and manage customer availability is going to take more than a few extra hands working on the customer infrastructure. Sure, they could configure vMotion (or something similar) to send your VMs to a different rack if one were to go offline. But who keeps tabs on that to make sure it happens? Who tests the failover to keep it consistent? What happens if there is a split brain scenario? Where is the password for that server stored?

You’re probably answering all of these questions off the top of your head because you’re the IT expert, right? So if the cloud provider is doing this for you, what are you going to be doing? Clicking “next” on the installation prompt? If your tasks are being done by some other engineer in the cloud, what are we paying you for again? Just like automation, having a cloud engineer do your job for you means we don’t need to pay you any longer.

The second reason is liability. Right now, if there is an incident that knocks a cloud provider offline they’re liable for the downtime for all their customers. Most of the contracts have a force majeure clause built into them that exempts liability for extraordinary circumstances, such as fire, weather, or even terrorist activity. That way the provider doesn’t need to pay you back for something there was no way to have foreseen. if there is some kind of outage caused by some technical issue then they will owe you for that one.

However, if the cloud provider starts managing your equipment and services for you then they are liable if there is an outage. If they screw up the vMotion settings or DRS starts getting too aggressive and migrates a VM to a bad host who is responsible for the downtime? If it’s you then you get yelled at and the company loses money. If it’s the provider managing for you then the provider gets yelled at, threatened, and possibility litigated to recover the lost income. See now why no provider wants to touch your stuff? We used to have a rule when I worked at a VAR that you better be very careful about which problems you decided to fix out of the project scope. Because as soon as you touch those they become your problems and you’re on the hook to fix them.

Lastly, the provider isn’t responsible for your redundancy for one other simple reason: you’re not paying them for it. If Amazon or Microsoft or Google offered a hosting package that included server replication and monitoring and 99.9999% uptime of your application data do you think it would cost the same as the basic instance pricing? You’d better believe it wouldn’t! These companies would be happy to sell you just what you’re looking for but you aren’t going to want to pay the price for it. It’s easy to build in the cost of a generator spread across hundreds or thousands of customers. But if you want someone backing your data up every day and validating it you’re going to be paying the lion’s share of the cost. And most of the people using low-cost providers for non-critical workloads aren’t going to want to pay extra anyway.


Tom’s Take

I get it. Cloud is nice and it’s always there. Cloud is easier to build out than your on-prem data center. Cloud makes life easy. However, cloud is not magic. Just because AWS doesn’t have outages every month doesn’t mean your servers won’t go down if you’re not backing them up. Just because you have availability zones you can use doesn’t mean data is magically going to show up in Oregon because you want it to. You’re going to have to pay the cost to make your infrastructure redundant on top of the provider infrastructure. That means money invested in software, time invested in deployment, and workers invested in making sure it all checks out when you need it to be there to catch your outages. Don’t assume the cloud is going to solve every one of your challenges. Because if it did we’d be paying a lot more for it and IT workers wouldn’t be getting paid at all.

Planning For The Worst Case You Can’t Think Of

Remember that Slack outage earlier this month? The one that happened when we all got back from vacation and tried to jump on to share cat memes and emojis? We all chalked it up to gremlins and went on going through our pile of email until it came back up. The post-mortem came out yesterday and there were two things that were interesting to me. Both of them have implications on reliability planning and how we handle the worst-case scenarios we come up with.

It’s Out of Our Hands

The first thing that came up in the report was that the specific cause for the outage came from an AWS Transit Gateway not being able to scale fast enough to handle the demand spike that came when we all went back to work on the morning of January 4th. What, the cloud can’t scale?

The cloud is practically limitless when it comes to resources. We can create instances with massive CPU resources or storage allocations or even networking pipelines. However, we can’t create them instantly. No matter how much we need it takes time to do the basic provisioning to get it up and running. It’s the old story of eating an elephant. We do it one bite at a time. Normally we tell the story to talk about breaking a task down into smaller parts. In this case, it’s a reminder that even the biggest thing out there has to be dealt with in small pieces as well.

Slack learned this lesson the hard way. Why? Because they couldn’t foresee a time when their service was so popular that the amount of traffic rushing to their servers crushed the transit gateways. Other companies have had to learn this lesson the hard way too. Disney crawled on launch day because of demand. The release of a new game that requires downloading and patching on Day One also puts stress on servers. Even the lines outside of department stores on Black Friday (in less pandemic-driven years) are examples of what happens when capacity planning doesn’t meet demand.

When you plan for your worst case scenario, you have to think the unthinkable. Instead of asking yourself what might happen if everyone logs on at the same time you also need to ask when happens if they try and something goes wrong. I spent a lot of time in my former job thinking about simple little exercises like VDI boot storms, where office workers can push a storage system to the breaking point by simply turning all their machines on at the same time. It’s the equivalent of being on a shared network resource like a cable modem during the Super Bowl. There aren’t any resources available for you to use.

When we plan for capacity, we have to realize that even our most optimistic projections of usage are going to be conservative if we take off or go viral. Rather than guessing what that might be, take an hour every six months and readjust your projections. See how fast you’re growing. Plan for that crazy scenario where everyone decides to log on at the same time on a day where no one has had their coffee yet. And be ready for what happens when someone throws a wrench into the middle of the process.

What Happens When It Goes Wrong?

The second thing that came up in the Slack post-mortem that is just as worrisome was the behavior of the application when it realized there was a connection timeout. The app started waiting for the pathway to be open again. And guess what happened when AWS was able to scale the transit gateway? Slack clients started hammering the servers with connection requests. The result was something akin to a Distributed Denial-of-Service (DDoS) attack.

Why would the Slack client do this? I think it has something to do with the way the developers coded it and didn’t anticipate every client trying to reconnect all at once. It’s not entirely unsound thinking to be honest. How could we live in a world where every Slack user would be disconnected all at once? Yet, we do and it did and look what happened?

Ethernet figured this part out a long time ago. The CSMA/CD method for detecting collisions on a layer 2 connection has an ingenious solution for what happens when a collision is detected. Once it realizes that there was a problem on the wire it stops what is going on and calculates a random backoff timer based on the number of detected collisions. Once that timer has expired it attempts to transmit again. Because there has to be another station involved in a collision incident both stations do this. The random element of the timer calculation ensures that the likelihood of both stations choosing to transmit again at the same time is very, very low.

If Ethernet behaved like the Slack client did we would never resolve collisions. If every station on a layer 2 network immediately tried to retransmit without a backoff timer the bus would be jammed constantly. The architects of the protocol figured out that every station needs a cool off period to clear the wire before trying again. And it needs to be different for every station so there is no overlap.

Slack really needs to take this idea into account. Rather than pouncing on a connection as soon as it’s available there needs to be a backoff timer that prevents the servers from being swamped. Even a few hundred milliseconds per client could have prevented this outage from getting as big as it did. Slack didn’t plan beyond the worst case scenario because they never conceived of their worst case scenario coming to pass. How could it get worse than something we couldn’t imagine happening?


Tom’s Take

If you design systems or call yourself a reliability engineer, you need to develop a hobby of coming up with disastrous scenarios. Think of the worst possible way for something to fail. Now, imagine it getting worse. Assume that nothing will work properly when there is a recovery attempt. Plan for things to be so bad that you’re in a room on fire trying to put everything out while you’re also on fire. It sounds very dramatic but that’s how bad it can get. If you’re ready for that then nothing will surprise you. You also need to make sure you’re going back and thinking of new things all the time. You never know which piece is going to fail and how it will impact what you’re working on. But thinking through it sometimes gives you an idea of where to go when it all goes wrong.

The Silver (Peak) Lining For HPE and Cloud

You no doubt saw the news this week that HPE announced that they’re buying Silver Peak for just shy of $1 billion dollars. It’s a good exit for Silver Peak and should provide some great benefits for both companies. There was a bit of interesting discussion around where this fits in the bigger picture for HPE, Aruba, and the cloud. I figured I’d throw my hat in the ring and take a turn discussing it.

Counting Your Chickens

First and foremost, let’s discuss where this acquisition is headed. HPE announced it and they’re the ones holding the purse strings. But the acquisition post was courtesy of Keerti Melkote, who runs the Aruba, a Hewlett Packard Enterprise Company (Aruba) side of the house. Why is that? It’s because HPE “reverse acquired” Aruba and sent all their networking expertise and hardware down to the Arubans to get things done.

I would venture to say that Aruba’s acquisition was the best decision HPE could have made. It gave them immediate expertise in an area they sorely needed help. It gave Aruba a platform to build on and innovate from. And it ultimately allowed HPE to shore up their campus networking story while trying to figure out how they wanted to extend into the data center.

Aruba was one of the last major networking players to announce a strategy based on SD-WAN. We’ve seen a lot of major acquisitions on that front, including Cisco buying Viptela, VMware buying VeloCloud, Palo Alto Networks buying CloudGenix, and Oracle buying Talari. That last one is going to be important later in this story. Aruba didn’t go out and buy their own SD-WAN solution. Instead, they developed it in-house leveraging the expertise they had with ClearPass. Instead of calling it SD-WAN and focusing on connecting islands together, they used the term SD-Branch to denote the connectivity was more about the users in the branch and not the office itself.

I know that SD-Branch isn’t a term that’s en vogue with most analysts. But it’s important to realize that Aruba was trying to say more about the users than anything else. Hardware was an afterthought in this solution. It wasn’t about an edge gateway, although Aruba had those. It wasn’t about connectivity, even though Aruba could help on that front too. Instead, it was about pushing policy down to the edge and sorting out connectivity for devices. That’s the focus that Aruba had for many years with their wireless roots. It only made sense to leverage the tools to get where they wanted to be on the SD-Whatever front.

Coming Home To Roost

The world of SD-WAN isn’t the same as the branch any longer, though. Now, SD-WAN drives cloud on-ramp and edge security. Ironically enough, the drive to include Secure Access Service Edge (SASE) in SD-WAN is way more in line with the original concept of SD-Branch as defined by Aruba a couple of years ago. But you have to have a more well-rounded solution that includes cloud.

Why do you need to worry about the cloud? If you still have to ask that question in 2020 you’re not gonna make it in IT much longer. The cloud operationalizes a lot of things in IT that we have just accepted for years. The way that we do workloads and provisioning has changed on-site as well. If you don’t believe me then listen to HPE. Their biggest focus coming out of HPE Discover this year has been highlighting GreenLake, their on-premises IT-as-a-Service offering. It’s designed to give you cloud-like performance in your data center with cloud-like billing options. And, as marketed, the ability to move workloads back and forth between on-site and in-cloud as needed.

Hopefully, you’re starting to see why Silver Peak was such an important pickup for HPE now. SD-Branch is focused on devices but not services. It’s not designed to provide cloud on-ramp. HPE and Aruba need a solution that gives you the ability to accelerate user adoption of software-as-a-service no matter where it lives, be it in GreenLake or AWS or Azure. Essentially, HPE and Aruba needed Talari for their cloud offerings. And that’s what they’re going to get with Silver Peak.

Silver Peak has focused on cloud intelligence to accelerate SaaS for a few years now. They also have a multi-cloud networking solution. Multi-cloud is the way that you get people working between two different clouds, like AWS and GreenLake for example.

When you tie in Silver Peak’s DC-focused SD-WAN solution with Aruba’s existing SD-Branch capabilities, you see how holistic a solution you have now. And because it’s all based on software you don’t have to worry about clunky integration. It can run side-by-side for now and in the next revision of Aruba Central integration with the new Edge Services Platform (ESP), it’s all going to be seamless to use however you want to use.


Tom’s Take

I think Silver Peak is a good pickup for HPE and Aruba. Just remember that when you hear HPE and networking in the same sentence, you need to think about Aruba. Ultimately, the integration of Silver Peak into Aruba’s SD-Branch solution is going to benefit everyone from users to cloud to software and back again. And it’s going to help position Aruba as a major player in the SD-WAN market. Which is a silver lining on any cloud.

Clear Skys for IBM and Red Hat

There was a lot of buzz this week when IBM announced they were acquiring Red Hat. A lot has been discussed about this in the past five days, including some coverage that I recorded with the Gestalt IT team on Monday. What I wanted to discuss quickly here is the aspirations that IBM now has for the cloud. Or, more appropriately, what they aren’t going to be doing.

Build You Own Cloud

It’s funny how many cloud providers started springing from the earth as soon as AWS started turning a profit. Microsoft and Google seem to be doing a good job of challenging for the crown. But the next tier down is littered with people trying to make a go of it. VMware with vCloud Air before they sold it. Oracle. Digital Ocean. IBM. And that doesn’t count the number of companies offering a specific function, like storage, and are calling themselves a cloud service provider.

IBM was well positioned to be a contender in the cloud service provider (CSP) market. Except they started the race with a huge disadvantage. IBM was a company that was focused on selling solutions to their customers. Just like Oracle, IBM’s primary customer was external. The people they cared most about wrote them checks and they shipped out eServers and OS/2 Warp boxes.

Compare and contrast that with Amazon. Who is Amazon’s customer? Well, anyone that wants to buy something. But who consumes the products that Amazon builds for IT? Amazon people. Their infrastructure is built to provide a better experience for the people using their site. Amazon is very good at building systems that can handle a high amount of users and load and not buckle.

Now, contrary to the myth, Amazon did not start AWS with spare capacity. While that makes for a good folk tale, Amazon probably didn’t have a lot of spare capacity lying around. Instead, they had a lot of great expertise in building large-scale reliable systems and they parlayed that into a solution that could be used to bring even more people into the Amazon ecosystem. They built AWS with an eye toward selling services, not hardware.

Likewise, Microsoft’s biggest customers are their developers. They are focused on making the best applications and operating systems they can. They don’t sell hardware, aside from the rare occasional foray into phones or laptops. But they wanted their customers to benefit from the years of work they had put in to developing robust systems. That’s where Azure came from.

IBM is focused on customers buying their hardware and their expertise for installing it. AWS and Microsoft want to rent their expertise and software for building platforms. That difference in perspective is why IBM’s cloud aspirations were never going to take off to new heights. They couldn’t challenge for the top three places unless Google suddenly decided to shut down Google Cloud Engine. And no matter how hard they tried, Larry Ellison was always going to be nipping at their heels by pouring money into his cloud offerings to be on top. He may never get there but he is determined to make the best showing he can.

Putting On The Red Hat

Where does that leave IBM after buying Red Hat. Well, Red Hat sells software and services for it. But those services are all focused on integration. Red Hat has never built their own cloud platform. Instead, they work on everyone else’s platform effectively. They can deploy an OS or a container system on Amazon or Azure with no hiccups.

IBM has to realize now that they will never unseat Amazon. The momentum behind this 850-lb gorilla is just too much to challenge. The remaining players are fighting for a small piece of third or fourth place at this point. And yes, while Google has a comfortable hold on third place right now, they do have a tendency to kill projects that aren’t AdWords or the search engine homepage. Anything else lives in a world of uncertainty.

So, how does IBM compete? They need to leverage their expertise. They’ve sold off anything that has blinking lights, save for the mainframe division. They need to embrace their Global Services heritage and shepherd the SMEs that are afraid of the cloud. They need to help enterprises in the mid-range build into AWS and Azure instead of trying to make a huge profit off them and leave them high and dry. The days of making a fortune from Fortune 100 companies with no cloud aspirations are over. Just like the fight for cloud dominance, the battle lines are drawn and the prize isn’t one or two big companies. It’s a bunch of smaller ones.

The irony isn’t lost on me that IBM’s future lies in smaller companies. The days of “No one ever got fired for buying IBM” are long past in the rearview mirror. Instead, companies need the help of smart people to move into the cloud. But they also need to do it natively. They don’t need to keep running their old hybrid infrastructure. They need a trusted advisor that can help them build something that will last. IBM could be that company with the help of Red Hat. They could reinvent themselves all over again and beat the coming collapse of providers of infrastructure. As more companies start to look toward the cloud, IBM can help them along the path. But it’s going to take some realistic looks at what IBM can provide. And the end of IBM’s hope of running their own CSP.


Tom’s Take

I’m an old IBMer. At least, I interned there in 2001. I was around for all the changes that Lou Gerstner was trying to implement. I worked in IBM Global Services where they made the AS/400. As I’m fond of saying over and over again, IBM today is not Tom Watson’s IBM. It’s a different animal that changed with the times at just the right time. IBM is still changing today, but they aren’t as nimble as they were before. Their expertise lies all over the landscape of hot new tech, but people don’t want blockchain-enabled AI for IoT edge computing. They want a trusted partner than can help them with the projects they can’t get done today. That’s how you keep your foot in the door. Red Hat gives IBM that advantage. They key is whether or not IBM can see that the way forward for them isn’t as cloudy as they had first imagined.

The Long And Winding Network Road

How do you see your network? Odds are good it looks like a big collection of devices and protocols that you use to connect everything. It doesn’t matter what those devices are. They’re just another source of packets that you have to deal with. Sometimes those devices are more needy than others. Maybe it’s a phone server that needs QoS. Or a storage device that needs a dedicated transport to guarantee that nothing is lost.

But what does the network look like to those developers?

Work Is A Highway

When is the last time you thought about how the network looks to people? Here’s a thought exercise for you:

Think about a highway. Think about all the engineering that goes into building a highway. How many companies are involved in building it. How many resources are required. Now, think of that every time you want to go to the store.

It’s a bit overwhelming. There are dozens, if not hundreds, of companies that are dedicated to building highways and other surface streets. Perhaps they are architects or construction crews or even just maintenance workers. But all of them have a function. All for the sake of letting us drive on roads to get places. To us, the road isn’t the primary thing. It’s just a way to go somewhere that we want to be. In fact, the only time we really notice the road is when it is in disrepair or under construction. We only see the road when it impacts our ability to do the things it enables.

Now, think about the network. Networking professionals spend their entire careers building bigger, faster networks. We take weeks to decide how best to handle routing decisions or agonize over which routing protocols are necessary to make things work the way we want. We make it our mission to build something that will stand the test of time and be the Eighth Wonder of the World. At least until it’s time to refresh it again for slightly faster hardware.

The difference between these two examples is the way that the creators approach their creation. Highway workers may be proud of their creation but they don’t spend hours each day extolling the virtues of the asphalt they used or the creative way they routed a particular curve. They don’t demand respect from drivers every time someone jumps on the highway to go to the department store.

Networking people have a visibility problem. They’re too close to their creation to have the vision to understand that it’s just another road to developers. Developers spend all their time worrying about memory allocation and UI design. They don’t care if the network actually works at 10GbE or 100 GbE. They want a service that transports packets back and forth to their destination.

The Old New Network

We’ve had discussion in the last few years about everything under the sun that is designed to make networking easier. VXLAN, NFV, Service Mesh, Analytics, ZTP, and on and on. But these things don’t make networking easier for users. They make networking easier for networking professionals. All of these constructs are really just designed to help us do our jobs a little faster and make things work just a little bit better than they did before.

Imagine all the work that goes into electrical power generation. Imagine the level of automation and monitoring necessary to make sure that power gets from the generation point to your house. It’s impressive. And yet, you don’t know anything about it. It’s all hidden away somewhere unimpressive. You don’t need to describe the operation of a transformer to be able to plug in a toaster. And no matter how much that technology changes it doesn’t impact your life until the power goes out.

Networking needs to be a utility. It needs to move away from the old methods of worrying about how we create VLANs and routing protocols and instead needs to focus on disappearing just like the power grid. We should be proud of what we build. But we shouldn’t make our pride the focus of hubris about what we do. Networking professionals are like highway workers or electrical company employees. We work hard behind the scenes to provide transport for services. The cloud has changed the way we look at the destination for those services. And it’s high time we examine our role in things as well.


Tom’s Take

Telco workers. COBOL programmers. Data entry specialists. All of these people used to be the kings and queens of their field. They were the people with the respect of hundreds. They were the gatekeepers because their technology and job roles were critical. Until they weren’t any more. Networking is getting there quickly. We’ve been so focused on making our job easy to do that we’ve missed the point. We need to be invisible. Just like a well built road or a functioning electrical grid. We are not the goal of infrastructure. We’re just a part of it. And the sooner we realize that and get out of our own way, we’re going to find that the world is a much better place for everyone involved in IT, from developers to users.

Legacy IT Is Not A Monument

During Networking Field Day 17, there was a lot of talk about legacy IT constructs, especially as they relate to the cloud. Cloud workloads are much better when they are new things with new applications and new processes. Existing legacy workloads are harder to move to the cloud, especially if they require some specific Java version or special hardware to work properly.

We talk a lot about how painful legacy IT is. So why do we turn it into a monument that spans the test of time?’

Keeping Things Around

Most monuments that we have from ancient times are things that we never really intended to keep. Aside from the things that were supposed to be saved from the beginning, most iconic things were never built to last. Even things like the Parthenon or the Eiffel Tower. These buildings were always envisioned to be torn down sooner or later.

Today, we can’t imagine a world without those monuments. We can’t conceive of a time without them. And, depending on the amount of time that elapsed between the building or creation of things and the decision to preserve it there could be irreparable damage. Yet that just adds to the charm.

Now, apply those factors to legacy IT. We have software that is outdated. We have applications that need old Java versions to run correctly. Or outdated DLLs. Or some other kind of thing that could cause complication or damage to our systems. Yet, we can’t bear to part with our old familiar IT. Maybe it was a UI change. Maybe it never really ran correctly on a new operating system. Or maybe the new version cost so much to upgrade that it was just cheaper to keep hacking the old thing to work slightly better each time.

Rather than examining how we could replicate the workload or find a better, quicker way to do things, we find ourselves building legacy IT into a preserved monument. We freeze the software or hardware and we never update it. We build crazier and more complicated solutions to keep something running that is well past the retirement date.

View Only

The other complication of legacy IT is that we use the programs but we never really do more than look at them. We don’t focus on the data or the process. Instead, we just plug things into a system and run what we need to run. I remember working for IBM and having to enter my weekly timesheet twice. Why? Because the new timesheet system was a Java app that ran on Windows 2000. But the system that paychecks were generated from for hourly employees (like me) was run from an AS/400 terminal window. So, after I spent half an hour entering my time for the week, I had to spend another half hour entering those exact same time entries into a console.

Would it have been easier to replicate the functions of the terminal program in Java and make time entry a single thing? Sure. Would it be easier for everything to integrate and reduce time for employees? Sure. But the people that had been using the console program for the last decade had it down. They could enter their time in a few minutes. In fact, even though the Java program was more precise for time increments most employees hated it. They’d rather use an imprecise and outdated program because it was faster and more familiar. Even though they had to manually edit their timesheet after the fact because the newer reason codes weren’t loaded in the old system.

Read-only legacy IT makes everyone’s life miserable. All kinds of crazy patches and hacks are necessary to make it run correctly with new functions. And no matter what the replacement solution is something that people will hate. Simply because it’s not the old system.

Hands Off

This, for me, is the hardest part of legacy IT monuments. Once it works, NEVER TOUCH IT AGAIN. You can’t migrate, upgrade, or move a machine. You can’t get newer hardware that’s under support. The number of times that I’ve had to buy parts from Ebay to fix broken legacy systems is much, much to high.

Now, we have to worry about what happens when the system never comes back. Or when we push it past the breaking point for some legacy app. Look at the shift from 32-bit applications to 64-bit. We’re in the transition process and yet, still, there are people that have some old application or hardware device that can’t run on newer software. Once we force a cutoff, we have to find a way to build band-aids that people can use to make a decade-old thing work.

This hands-off mentality is also part of the reason why cloud migration projects fail. Even if you can get 90% of the software in your environment to work the way you want, the odds are that the remaining 10% is composed of legacy applications that haven’t been retired because they are mission critical and very finicky. They won’t migrate. They might even be running in VMs that have to emulate old OSes because they can’t ever be upgraded. And those kinds of old familiar pets are the ones that take too much of your time.


Tom’s Take

Monuments are old buildings. We keep them around because they remind us of how things used to be. They might be falling apart. They may take more time to keep up but people love to see them and enjoy them for the nostalgia. Legacy IT is not that. It’s a headache. It’s a pain to have read-only apps that we can never change because we don’t want to or can’t get them running on something new. Rather than building them into a static monument, we need to retire them and find a way to build something new. Because no matter how beautiful they may be, no legacy IT project will ever stand the test of time like the Parthenon.

Linux and the Quest for Underlays

TuxUnderlay

I’m at the OpenStack Summit this week and there’s a lot of talk around about building stacks and offering everything needed to get your organization ready for a shift toward service provider models and such. It’s a far cry from the battles over software networking and hardware dominance that I’m so used to seeing in my space. But one thing came to mind that made me think a little harder about architecture and how foundations are important.

Brick By Brick

The foundation for the modern cloud doesn’t live in fancy orchestration software or data modeling. It’s not because a retailer built a self-service system or a search engine giant decided to build a cloud lab. The real reason we have a growing market for cloud providers today is because of Linux. Linux is the underpinning of so much technology today that it’s become nothing short of ubiquitous. Servers are built on it. Mobile operating systems use it. But no one knows that’s what they are using. It’s all just something running under the surface to enable applications to be processed on top.

Linux is the vodka of operating systems. It can run in a stripped down manner on a variety of systems and leave very little trace behind. BSD is similar in this regard but doesn’t have the driver support from manufacturers or the ability to strip out pieces down to the core kernel and few modifications. Linux gives vendors and operators the flexibility to create a software environment that boots and gets basic hardware working. The rest is up to the creativity of the people writing the applications on top.

Linux is the perfect underlay. It’s a foundation that is built upon without getting in the way of things running above it. It gives you predictable performance and a familiar environment. That’s one of the reasons why Cumulus Networks and Dell have embraced Linux as a way to create switch operating systems that get out of the way of packet processing and let you build on top of them as your needs grow and change.

Break The Walls Down

The key to building a good environment is a solid underlay, whether it be be in systems or in networking. With reliable transport and operations taken care of, amazing things can be built. But that doesn’t mean that you need to build a silo around your particular area of organization.

The shift to clouds and stacks and “new” forms of IT management aren’t going to happen if someone has built up a massive blockade. They will work when you build a system that has common parts and themes and allows tools to work easily on multiple parts of the infrastructure.

That’s what’s made Linux such a lightning rod. If your monitoring tools can monitor servers, SANs, and switches with little to no modification you can concentrate your time on building on those pieces instead of writing and rewriting software to get you back to where you started in the first place. That’s how systems can be extensible and handle changes quickly and efficiently. That’s how you build a platform for other things.


Tom’s Take

I like building Lego sets. But I really like building them with the old fashioned basic bricks. Not the fancy new ones from licensed sets. Because the old bricks were only limited by your creativity. You could move them around and put them anywhere because they were all the same. You could build amazing things with the right basic pieces.

Clouds and stacks aren’t all that dissimilar. We need to focus on building underlays of networking and compute systems with the same kinds of basic blocks if we ever hope to have something that we can build upon for the future. You may not be able to influence the design of systems at the most basic level when it comes to vendors and suppliers, but you can vote with your dollars to back the solutions that give you the flexibility to get your job done. I can promise you that when the revenue from proprietary, non-open underlay technologies goes down the suppliers will start asking you the questions you need to answer for them.