Monitoring Other People’s Problems

It’s Always the Network is a refrain that causes operations teams to shudder. No matter what your flavor of networking might be it’s always your fault. Even if the actual problem is DNS, a global BGP outage, or even some issue with the SaaS provider. Why do we always get blamed? And how can you prevent this from happening to you?

User Utopia

Users don’t know about the world outside of their devices. As soon as they click on something in a browser window they expect it to work. It’s a lot like ordering a package and having it delivered. It’s expected that the package arrives. You don’t concern yourself with the details of how it needs to be shipped, what routes it will take, and how factors that exist half a world away could cause disruptions to your schedule at home.

The network is the same to the users. If something doesn’t work with a website or a remote application it must be the “network” that is at fault. Because your users believe that everything not inside of their computer is the network. Networking is the way that stuff happens everywhere else. As professionals we know the differences between the LAN, the WAN, the cloud, and all points in between. However your users do not.

Have you heard about MTTI? It’s a humorous metric that we like to quote now that stands for “mean time to innocence”. Essentially we’re saying that we’re racing to find out why it’s not our problem. We check the connections, look at the services in our network, and then confidently explain to the user that someone else is at fault for what happened and we can’t fix that. Sounds great in theory, right?

Would you proudly proclaim that it’s someone else’s fault to the CEO? The Board of Directors? How about to the President of the US? Depending on how important the person you are speaking to might be you might change the way you present the information. Regular users might get “sorry, not my fault” but the CEO could get “here’s the issue and we need to open a ticket to fix it”. Why the varying service levels? Is it because regular users aren’t as important as the head of the company? Or is it because we don’t want to put in the extra effort for a knowledge workers that we would gladly do for the person that in ultimately going to approve a raise for us if we do extra work? Do you see the fallacy here?

Keeping Your Eyes Peeled

The answer, of course, is that every user at least deserves an explanation of where the problem is. You might be bent on justifying that it’s not in your network so you don’t have to do any more work but you also did enough work to place the blame outside of your area of control. Why not go one step further? You could check a dashboard for services on a cloud provider to see if they’re degraded. Or do a quick scan of news sites or social media to see if it’s a more widespread issue. Even checking a service to see if the site is down for more than just you is more information than just “not my problem”.

The habit of monitoring other services allows you to do two things very quickly. First is that it lowers the all-important MTTI metric. If you can quickly scan the likely culprits for outages you can then say with certainty that it’s out of your control. However, the biggest benefit to monitoring services outside of your organization that you rely on is that you’ll have forewarning for issues in your own organization. If your users come to you and say that they can’t get to Microsoft 365 and you already know that’s because they messed up a WAN router configuration you’ll head off needless blame. You’ll also know where to look next time to understand challenges.

If you’re already monitoring the infrastructure in your organization it only makes sense to monitor the infrastructure that your users need outside of it. Maybe you can’t get AWS to send you the SNMP strings you need to watch all their internal switches but you can easily pull information from their publicly available sources to help determine where the issues might lie. That helps you in the same way that monitoring switch ports in your organization helps you today. You can respond to issues before they get reported so you have a clear picture of what the impact will be.

Tom’s Take

When I worked for a VAR one of my biggest pet peeves was “not my problem” thinking. Yes, it may not be YOUR problem but it is still A problem for the user. Rather than trying to shift the blame let’s put our heads together to investigate where the problem lies and create a fix or a workaround for it. Because the user doesn’t care who is to blame. They just want the issue resolved. And resolution doesn’t come from passing the buck.

Linux and the Quest for Underlays

TuxUnderlay

I’m at the OpenStack Summit this week and there’s a lot of talk around about building stacks and offering everything needed to get your organization ready for a shift toward service provider models and such. It’s a far cry from the battles over software networking and hardware dominance that I’m so used to seeing in my space. But one thing came to mind that made me think a little harder about architecture and how foundations are important.

Brick By Brick

The foundation for the modern cloud doesn’t live in fancy orchestration software or data modeling. It’s not because a retailer built a self-service system or a search engine giant decided to build a cloud lab. The real reason we have a growing market for cloud providers today is because of Linux. Linux is the underpinning of so much technology today that it’s become nothing short of ubiquitous. Servers are built on it. Mobile operating systems use it. But no one knows that’s what they are using. It’s all just something running under the surface to enable applications to be processed on top.

Linux is the vodka of operating systems. It can run in a stripped down manner on a variety of systems and leave very little trace behind. BSD is similar in this regard but doesn’t have the driver support from manufacturers or the ability to strip out pieces down to the core kernel and few modifications. Linux gives vendors and operators the flexibility to create a software environment that boots and gets basic hardware working. The rest is up to the creativity of the people writing the applications on top.

Linux is the perfect underlay. It’s a foundation that is built upon without getting in the way of things running above it. It gives you predictable performance and a familiar environment. That’s one of the reasons why Cumulus Networks and Dell have embraced Linux as a way to create switch operating systems that get out of the way of packet processing and let you build on top of them as your needs grow and change.

Break The Walls Down

The key to building a good environment is a solid underlay, whether it be be in systems or in networking. With reliable transport and operations taken care of, amazing things can be built. But that doesn’t mean that you need to build a silo around your particular area of organization.

The shift to clouds and stacks and “new” forms of IT management aren’t going to happen if someone has built up a massive blockade. They will work when you build a system that has common parts and themes and allows tools to work easily on multiple parts of the infrastructure.

That’s what’s made Linux such a lightning rod. If your monitoring tools can monitor servers, SANs, and switches with little to no modification you can concentrate your time on building on those pieces instead of writing and rewriting software to get you back to where you started in the first place. That’s how systems can be extensible and handle changes quickly and efficiently. That’s how you build a platform for other things.


Tom’s Take

I like building Lego sets. But I really like building them with the old fashioned basic bricks. Not the fancy new ones from licensed sets. Because the old bricks were only limited by your creativity. You could move them around and put them anywhere because they were all the same. You could build amazing things with the right basic pieces.

Clouds and stacks aren’t all that dissimilar. We need to focus on building underlays of networking and compute systems with the same kinds of basic blocks if we ever hope to have something that we can build upon for the future. You may not be able to influence the design of systems at the most basic level when it comes to vendors and suppliers, but you can vote with your dollars to back the solutions that give you the flexibility to get your job done. I can promise you that when the revenue from proprietary, non-open underlay technologies goes down the suppliers will start asking you the questions you need to answer for them.