I recently had to have a technician come troubleshoot a phone issue at my home. I still have a landline with my cable provider. Mostly because it would be too expensive to change to a package without a phone. The landline does come in handy on occasion, so I needed to have it fixed. When I was speaking with the technician that came to fix things, I inquired about something the customer service people on the phone had said about upgrading my equipment. The field tech told me, “You don’t want that. Your old system is much better.” When he explained how the low voltage system would be replaced by a full voice over IP (VoIP) router, I agreed with him. My thoughts were mostly around the uptime of my phone in the event of a power outage.
Uptime is something that we have grown accustomed to in today’s world. If you don’t believe me, go unplug your wireless router for the next five minutes. If your family isn’t ready to burn you at the stake then you are luckier than most. For the rest of us we measure our happiness in the availability of services. Cloud email, streaming video, and Internet access all have to be available at the touch of a button. Whether it be for work or for personal use, uptime is very important in a connected world.
It still surprises me that people don’t focus on uptime as an important metric of their solutions. Selling redundant equipment or ensuring redundant paths should be one of the first considerations you have when planning a system. As Greg Ferro once told me, “When I tell you to buy one switch, I always mean two.” Backup equipment is as important as anything you can install.
You have to test your uptime as well. You don’t have to go to all the trouble of building your own chaos monkey, but you need to pull the plug on the primary every so often to be sure everything works. You also need to make sure that your backup systems are covered all the way down. Switches may function just fine with two control engines, but everything stops without power. Generators and battery backups are important. In the above case, I would need to put my entire network on a battery backup system in order to ensure I have the same phone uptime that I enjoy now with a relatively low-tech system.
You also have to account for other situations as well. Several gaming sites were taken offline recently due to the efforts of a group launching distributed denial of service (DDoS) attacks against soft targets like login servers. You have to make sure that the important aspects of your infrastructure are protected against external issues like this. Customers don’t know the difference between a security related attack and an outage. They all look the same in the eyes of a person paying for your service.
We should all strive to provide the most uptime possible for everything that we do. Potential customers may scoff at the idea of paying for extra parts they don’t currently use. That usually falls away once you explain what happens in the event of an outage. We should also strive to point out issues with contingency plans when we see them. Redundant circuits from a provider aren’t really redundant if they share the same last mile. You’ll never know how this affects you until you test your settings. When it comes to uptime, take nothing for granted. Test everything until you know that it won’t quit when failure happens.
Don’t just plan for downtime. Forget how many nines you support. I was once told that a software vendor had “seven nines” of uptime. I responded by telling them, “that’s three seconds of downtime allowed per year. Wouldn’t it just be easier to say you never go down?” Rather than having the mindset that something will eventually fail you should instead have the idea that everything will stay up and running. It’s a subtle shift in thinking, but changing your perception does wonders for designing solutions that are always available.
Reblogged this on The Cosmopolitan Geek and commented:
Numbers mean things 🙂
I don’t agree with the last part: “Rather than having the mindset that something will eventually fail you should instead have the idea that everything will stay up and running.”
No no, customers and managers already have that idea which leads to awkward situations when something does fail. I do agree with you that the focus of networking operations should be at keeping things up and running (pro-active), not improving the speed at which you patch things up after failures and use that figure as ‘redundancy’ or ‘uptime’.