DPUs Could Change The Network Forever

You wouldn’t think that AWS re:Invent would be a big week for networking, would you? Most of the announcements are focused on everything related to the data center but teasing out the networking specific pieces isn’t as easy. That’s why I found mention of a new-ish protocol in an unrelated article to be fascinating.

In this Register piece about CPUs there’s a mention of the Nitro DPU. More importantly there’s also a reference to something that Amazon has apparently been working on for the last couple of years. It turns out that the world’s largest online bookstore and data center company is looking to get rid of TCP.

Rebuilding Transport

The new protocol was developed in 2020. Referred to as Scalable Reliable Datagram (SRD), it was build to solve specific challenges Amazon was seeing related to performance in their cloud. Amazon decided that TCP had bigger issues for them that they needed to address.

The first was that dropped packets required retransmission. In an environment like the Internet that makes sense. You want to get the data you lost. However, when TCP was developed fifty years ago the amount of data that was lost in transit was tiny compared to the flows of today. Likewise, TCP doesn’t really know how to take advantage of multi path flows. That’s because any packet that arrives out-of-order requires the whole thing to be reassembled in order to be read by the operating system. That makes total sense when you think about the relative power of a computer back in the 70s and 80s. If the CPU is already super busy trying to do other things you don’t want it to have to try to deal with a messy stream of packets too. Halting the flow until it’s reassembled makes sense.

Today, Amazon would rather have the packets arrive on multiple different paths and be reassembled higher up the stack. Instead of forcing the system to stop transmitting the entire flow until the lost or out-of-order pieces are fixed they’d rather get the whole thing delivered and start putting the puzzle together while the I/O interface is processing the next flow. That’s because the size of the flows is creating communications issues between systems. They would rather have slightly higher latency to increase performance.

DPUs Or Bust

If this is so obvious, why has no one rewritten TCP before? Well, we hinted at it when we discussed the fact that TCP is stopping the flow to sort out the mess. The CPU will be fully tasked with reassembling flows with something like SRD because networking communications are constant. If you’re hoping that whatever your successor is will just magically sort things out you’re going to find a system that is quite busy with things that shouldn’t be taking so much time. The perceived gains in performance are going to evaporate.

The key to how SRD will actually work lies not in the protocol but in the way it’s implemented in hardware. SRD only works if you’re using an AWS Nitro DPU. When Amazon says they want the packets to be reassembled “up the stack” they’re really saying the want the DPU to do the work to put the pieces back together before presenting the reassembled packet back to the system. The system itself doesn’t know the packets came in out-of-order. The system doesn’t even know how the packets arrived. It just knows it sent data somewhere else and that it arrived with no errors.

The magic here is the DPU. TCP works for the purpose it was designed to do. If you’re transmitting packets over a wide area in serial fashion and there’s packet loss on the link TCP is the way to go. Amazon SRD only works with Nitro-configured systems in AWS. It’s a given that many servers are now in AWS and more than a few are going to have this additional piece of hardware installed and configured. The value is that having this enabled is going to increase performance. But there’s a cost.

I think back to configuring jumbo Ethernet frames for a storage array that I was configuring back in 2011. Enabling 9000 byte frames between the switch and the array really increased performance. However, if you plugged the array in anywhere else or plugged anything into that port on the switch it broke. Why? Because 9000 byte Ethernet isn’t the standard. It’s a special configuration that only works when explicitly enabled.

Likewise, SRD works well within Amazon. You need to specifically enable it on your servers and those performance gains aren’t going to happen if you need to talk to something that isn’t SRD-enabled or isn’t configured with a Nitro DPU. Because the hard work is being done by the DPU you don’t have to worry about a custom hardware configuration in your application ruining your ability to migrate. However, if you’re relying on a specific performance level with the app to make things happen, like database lookups or writing files to a SAN, you’re not going to be able to move that app to another cloud without incurring a performance penalty.


Tom’s Take

I get that companies like Amazon are heavily invested in Ethernet but want higher performance. I also get that DPUs are going to enable us to do things with systems that we’ve never even really considered before, like getting rid of TCP and optimizing how packets are sent through the network. It’s a grand idea but it not only breaks compatibility but creates a gap in performance expectations as well. We already see this in the wireless space. People want gigabit file transfers with Wi-Fi 6 and are disappointed because their WAN connection isn’t that fast. If Amazon is promising super fast communications between systems users are going to be disappointed when they don’t get that in other clouds or between non-DPU systems. It creates lock-in and sets expectations that can’t be met. The future of networking is going to involve DPUs. But in order to really change the way we do things we need to work with the standards we have and make sure everyone can benefit.

Clear Skys for IBM and Red Hat

There was a lot of buzz this week when IBM announced they were acquiring Red Hat. A lot has been discussed about this in the past five days, including some coverage that I recorded with the Gestalt IT team on Monday. What I wanted to discuss quickly here is the aspirations that IBM now has for the cloud. Or, more appropriately, what they aren’t going to be doing.

Build You Own Cloud

It’s funny how many cloud providers started springing from the earth as soon as AWS started turning a profit. Microsoft and Google seem to be doing a good job of challenging for the crown. But the next tier down is littered with people trying to make a go of it. VMware with vCloud Air before they sold it. Oracle. Digital Ocean. IBM. And that doesn’t count the number of companies offering a specific function, like storage, and are calling themselves a cloud service provider.

IBM was well positioned to be a contender in the cloud service provider (CSP) market. Except they started the race with a huge disadvantage. IBM was a company that was focused on selling solutions to their customers. Just like Oracle, IBM’s primary customer was external. The people they cared most about wrote them checks and they shipped out eServers and OS/2 Warp boxes.

Compare and contrast that with Amazon. Who is Amazon’s customer? Well, anyone that wants to buy something. But who consumes the products that Amazon builds for IT? Amazon people. Their infrastructure is built to provide a better experience for the people using their site. Amazon is very good at building systems that can handle a high amount of users and load and not buckle.

Now, contrary to the myth, Amazon did not start AWS with spare capacity. While that makes for a good folk tale, Amazon probably didn’t have a lot of spare capacity lying around. Instead, they had a lot of great expertise in building large-scale reliable systems and they parlayed that into a solution that could be used to bring even more people into the Amazon ecosystem. They built AWS with an eye toward selling services, not hardware.

Likewise, Microsoft’s biggest customers are their developers. They are focused on making the best applications and operating systems they can. They don’t sell hardware, aside from the rare occasional foray into phones or laptops. But they wanted their customers to benefit from the years of work they had put in to developing robust systems. That’s where Azure came from.

IBM is focused on customers buying their hardware and their expertise for installing it. AWS and Microsoft want to rent their expertise and software for building platforms. That difference in perspective is why IBM’s cloud aspirations were never going to take off to new heights. They couldn’t challenge for the top three places unless Google suddenly decided to shut down Google Cloud Engine. And no matter how hard they tried, Larry Ellison was always going to be nipping at their heels by pouring money into his cloud offerings to be on top. He may never get there but he is determined to make the best showing he can.

Putting On The Red Hat

Where does that leave IBM after buying Red Hat. Well, Red Hat sells software and services for it. But those services are all focused on integration. Red Hat has never built their own cloud platform. Instead, they work on everyone else’s platform effectively. They can deploy an OS or a container system on Amazon or Azure with no hiccups.

IBM has to realize now that they will never unseat Amazon. The momentum behind this 850-lb gorilla is just too much to challenge. The remaining players are fighting for a small piece of third or fourth place at this point. And yes, while Google has a comfortable hold on third place right now, they do have a tendency to kill projects that aren’t AdWords or the search engine homepage. Anything else lives in a world of uncertainty.

So, how does IBM compete? They need to leverage their expertise. They’ve sold off anything that has blinking lights, save for the mainframe division. They need to embrace their Global Services heritage and shepherd the SMEs that are afraid of the cloud. They need to help enterprises in the mid-range build into AWS and Azure instead of trying to make a huge profit off them and leave them high and dry. The days of making a fortune from Fortune 100 companies with no cloud aspirations are over. Just like the fight for cloud dominance, the battle lines are drawn and the prize isn’t one or two big companies. It’s a bunch of smaller ones.

The irony isn’t lost on me that IBM’s future lies in smaller companies. The days of “No one ever got fired for buying IBM” are long past in the rearview mirror. Instead, companies need the help of smart people to move into the cloud. But they also need to do it natively. They don’t need to keep running their old hybrid infrastructure. They need a trusted advisor that can help them build something that will last. IBM could be that company with the help of Red Hat. They could reinvent themselves all over again and beat the coming collapse of providers of infrastructure. As more companies start to look toward the cloud, IBM can help them along the path. But it’s going to take some realistic looks at what IBM can provide. And the end of IBM’s hope of running their own CSP.


Tom’s Take

I’m an old IBMer. At least, I interned there in 2001. I was around for all the changes that Lou Gerstner was trying to implement. I worked in IBM Global Services where they made the AS/400. As I’m fond of saying over and over again, IBM today is not Tom Watson’s IBM. It’s a different animal that changed with the times at just the right time. IBM is still changing today, but they aren’t as nimble as they were before. Their expertise lies all over the landscape of hot new tech, but people don’t want blockchain-enabled AI for IoT edge computing. They want a trusted partner than can help them with the projects they can’t get done today. That’s how you keep your foot in the door. Red Hat gives IBM that advantage. They key is whether or not IBM can see that the way forward for them isn’t as cloudy as they had first imagined.

Pushing Everyone’s Buttons In IT

HistoryEraserButton

We have officially reached the point in our long and storied IT careers where we, as old fogies, have earned the right to complain about the next generation of users and professionals. Just as the gray beards before us complained about the way we did things, so too is it our turn to moan about the state of affairs. Today, I’d like to point out how driving IT to the point of pushing simple buttons is destroying the way we do things.

Easy Buttons

The fact that IT work has been able to be distilled into a series of simple button pushing exercises is very thrilling. We’ve spent a lot of time and effort building devices and frameworks that take the hard part out of building devices and frameworks. We no longer have to invent languages to build things or hardware to do things. Instead, we can refine our programming capabilities or use general purpose hardware in new combinations to provide environments for our users.

That’s one of the things that is driving people to the cloud. Cloud isn’t just about exciting hardware or keeping your data in other places. It is just as much about predictable, repeatable frameworks and workflows that let people accomplish tasks without much thought. Just like the assembly line of the eighties, we are take boring repetitive tasks away from people and giving them to machines that don’t get bored and love repetition. Doing that drives costs down and makes people more productive.

But what happens when those processes break down? What happens when the magic smoke is let out of the machine? That’s when the real issues start cropping up. Perhaps the framework developers are great at figuring out what exactly went wrong in their solution. However, they’re likely clueless about all of the things that their solution is built upon. Imagine if a mechanic couldn’t diagnose the engine or the various subsystems of your car? You’d have a fit, right?

The troubleshooting level of modern framework engineering only extends to the edge of their solution. Once you get into a problem with a service they’ve leveraged, such as AWS, then the buck is passed and the troubleshooting must move to a new team. The problem with having your teams building next-generation IT services on top of existing things is that they lose visibility into the things they didn’t build. It becomes a vicious cycle when those first-level services aren’t reliable enough to keep your solution running.

Push A Button, Push A Button!

As bad as things are for IT departments in a push-button world, it’s even crazier for users in the same boat. That’s because users understand two extremes of service delivery. Either the thing is working or it isn’t. There is no in between. In the old days of non-instant IT, users would just keep asking until a thing was done.

Today, things are much different. Automation and orchestration has allowed frameworks and platforms to be able to instantly create services. That means users have been able to create their own platform for building things. So anything that isn’t instant is broken. Take, for example, a recent discussion of bandwidth from the spring ONUG meeting. An IT professional was frustrated that it took the better part of a day to transfer 1 petabyte of information across the country. He wasn’t upset that it failed, mind you. He was mad because it didn’t happen in five minutes. Non-instant things are broken.

The story repeats itself over and over again. Networking resources that can’t be automatically provisioned are broken. Cloud services that need more than a few minutes to spin up aren’t working correctly. Mobile apps and sites that don’t instantly pop up on your phone must be working incorrectly. The patience level of a user isn’t even a tenth of the average IT professional. IT pros know why something is running slow. Users fall back on slow things being broken.


Tom’s Take

The problem is investment. Given an infinite amount of funding, everything can be fast. But people are trying to find ways to not pay for things in today’s IT world. Maybe they want to pay for AWS because it works. But OpenStack gives the promise of AWS for free. Except things like OpenStack and Ansible aren’t free. The currency you use to pay for them is time.

Time is just as precious a commodity as money. We don’t get refunds on time. We can’t ask for discounts on time. We can only invest and hope that there is a payoff down the road. The real outcome of this time investment looks very similar to the environment we have today. Things should just work and allow us to build new things on top of them. But that missing time investment is the key to the whole enterprise. If we don’t spend the time building and extending things, then we won’t have the expertise we need to fix things when they break. That time investment also helps everyone appreciate just how hard it is to build things in the first place. And that’s something no button will every duplicate.

The Myth of Chargeback

 

Cash Register

Cash register by the National Cash Register Co., Dayton, Ohio, United States, 1915.

Imagine a world where every aspect of a project gets charged correctly. Where the massive amount of compute time for a given project gets labeled into the proper department and billed correctly. Where resources can be allocated and associated to the projects that need them. It’s an exciting prospect, isn’t it? I’m sure that at least one person out there said “chargeback” when I started mentioning all these lofty ideas. I would have agreed with you before, but I don’t think that chargeback actually exists in today’s IT environment.

Taking Charge

The idea of chargeback is very alluring. It’s been on slide decks for the last few years as a huge benefit to the analytics capabilities in modern converged stacks. By collecting information about the usage of an application or project, you can charge the department using that resource. It’s a bold plan to change IT departments from cost centers to revenue generators.

IT is the red headed stepchild of the organization. IT is necessary for business continuity and function. Nothing today can run without computers, networking, or phones. However, we aren’t a visible part of the business. Much like the plumbers and landscapers around the organization, IT’s job is to make things happen and not be seen. The only time users acknowledge IT is when something goes wrong.

That’s where chargeback comes into play. By charging each department for their usage, IT can seek to ferret out extraneous costs and reduce usage. Perhaps the goal is to end up a footnote in the weekly management meeting where Brian is given recognition for closing a $500,000 deal and IT gets a shout-out for figuring out marketing was using 45% more Exchange server space than the rest of the organization. Sounds exciting, doesn’t it?

In theory, chargeback is a wonderful way to keep departments honest. In practice, no one uses it. I’ve talked to several IT professionals about chargeback. About half of them chuckled when I mentioned it. Their collective experience can best be summarized as “They keep talking about doing that around here but no one’s actually figured it out yet.”

The rest have varying levels of implementation. The most advanced ones that I’ve spoken to use chargeback only for physical assets in a project. If Sales needs a new server and five new laptops for Project Hunter, then those assets are charged back correctly to the department. This keeps Sales from asking for more assets than they need and hoping that the costs can be buried in IT somewhere.

No one that I’ve spoken to is using chargeback for the applications and software in an organization. We can slice the pie as fine as we want for how to allocate assets that you can touch but when it comes to figuring out how to make Operations pay their fair share of the bill for the new CRM application we’re stuck. We can pull all the analytics all day long but we can’t seem to get them matched to the right usage.

Worse yet, politics plays a big role in chargeback. If a department head disagrees with the way their group is being characterized for IT usage, they can go to their superiors and talk about how critical their operation is to the business and how they need to be able to work without the restrictions of being billed for their usage. A memo goes out the next day and suddenly the department vanishes from the records with an admonishment to “let them do their jobs”.

Cloud Charges

The next thing that always comes up is public cloud. Chargeback proponents are waiting for wide-spread adoption of public cloud. That’s because the billing method for cloud is completely democratic. Everyone pays the price no matter what. If an AWS instance is running someone needs to pay for it. If those systems can be isolated to a specific application or department then the chargeback takes care of itself. Everyone is happy in the end. IT gets to avoid blame for not producing and the other departments get their resources.

Of course, the real problem comes when the bills start piling up. Cloud isn’t cheap. It exposes the dirty little secret that sunk-cost hardware has a purpose. When you bill based on CPU hour you’ll find that a lot of systems sit idle. Management will come unglued trying to figure out how cloud costs so much. The commercials and sales pitches said we would save money!

Then the politics start all over again. IT gets blamed because cloud was implemented wrong. No protesting will fix that. Then comes the rapid costs cutting measures. Shutting off systems not in use. Databases lose data capture for down periods. People can access systems in off hours. Work falls off and the cloud project gets scrapped for the old, cheaper way.

Cloud is the model for chargeback that should be used. But it should be noted that we need to remember those numbers need to be correctly attributed. Just pushing a set of usage statistics down without context will lead to finger pointing and scrambling for explanation. Instead, we need to provide context from the outset. Maybe Marketing used an abnormally high amount of IT resources last week. But did it have anything to do with the end of the quarter? Can we track that usage back to higher profits from sales? That context is critical to figuring out how usage statistics affect things overall.


Tom’s Take

Chargeback is the stick that we use to threaten organizations to shape up and fly right. We make plans to implement a process to track all the evil things that are hidden in a department and by the time the project is ready to kick off we find that costs are down and productivity is up. That becomes the new baseline and we go on about our day think about how chargeback would have let us catch it before it became a problem.

In reality, chargeback is a solution that will take time to implement and cost money and time to get right. We need data context and allocation. We need actionable information and the ability to coordinate across departments. We need to know where the charges are coming from and why, not just complaining about bills. And there can be no exceptions. That’s the only way to put chargeback in charge.