Continuity is Not Recovery

It was a long weekend for me but it wasn’t quite as long as it could have been. The school district my son attends is in the middle of a ransomware attack. I got an email from them on Friday afternoon telling us to make sure that any district-owned assets are powered off until further notice to keep our home networks from being compromised. That’s pretty sound advice so we did it immediately.

I know that the folks working on the problem spent the whole weekend trying to clean it up and make sure there isn’t any chance of getting reinfected. However, I also wondered how that would impact school this week. The growing amount of coursework that happens online or is delivered via computer is large enough that going from that to a full stop of no devices is probably jarring. That got me to thinking once more about the difference between continuity and recovery

Keeping The Lights On

We talk about disaster recovery a lot. Backups of any kind are designed to get back what was lost. Whether it’s a natural disaster or a security incident you want to be able to recover things back to the way they were before the disaster. We talk about making sure the data is protected and secured, whether from attackers or floods or accidental deletion. It’s a sound strategy but I feel it’s a missing a key component.

Aside from getting your data back, which is called the recovery point objective (RPO), you also need to consider how long it’s going to take to get you there. That’s called the recovery time objective (RTO). RTO tells you how long it will be until you can get your stuff back. For a few files the RTO could be minutes. For an entire data center it could be weeks. The RTO can even change based on the nature of the disaster. If you lose power to the building due to a natural disaster you may not even be able to start recovery for days which will extend the RTO due to circumstances outside your control.

For a business or organization looking to stay up and running during a disaster, RTO is critical but so too is the need for business continuity. How critical is it? The category was renamed to “Disaster Recovery and Business Continuity” many years ago. It’s not enough to get your data back. You have to stay up and running as much as possible during the process. You’ve probably experienced this if you’ve ever been to a store that didn’t have working registers or the ability to process credit cards. How can you pay for something if you can’t ring it up or process a payment option?

Business continuity isn’t about the data. It’s about keeping the lights on while you recover it. In the case of my son’s school they’re going to teach the old fashioned way. Lectures and paper are going to replace videos and online quizzes. Teachers are thankfully very skilled in this manner. They’ve spent hundreds if not thousands of hours in a classroom instructing with a variety of techniques. Are your employees equally as skilled when everything goes down? Could they get the job done if your Exchange Server goes down or they’re unable to log into Salesforce?

Back to Good, Eventually

In order to make sure you have a business left to recover you need to have some sort of a continuity plan. Especially in a world where cyberattacks are common you need to know what you have to do to keep things going while you work on fixing the damage. Most bad actors are counting on you not being able to conduct business as a driver to pay the ransom. If you’re losing thousands of dollars per minute you’re more likely to cave in and pay than try to spend days or weeks recovering.

Your continuity plan needs to exist separately from your backup RTO objectives. It may sound pessimistic but you need to have a plan for what happens if the RTO is met but also one for what happens if you miss your RTO. You don’t want to count on a quick return to normal operations as your continuity plan only to find out you’re not going to get there.

The other important thing to keep in mind is that continuity plans need to be functional, not perfect. You use the systems you use for a reason. Credit card machines make processing payments quick and easy. If they’re down you’re not going to have the same functionality. Yes, using the old manual process with paper slips and carbon copies is a pain and takes time. It’s also the only way you’re going to be able to take those payments when you can’t use the computer.

You also need to plan around how to handle your continuity plan. If you’re suddenly using more paper, such as invoices or credit card slips, where do you store those? How will you process them once the systems come back online? Will you need to destroy anything after it’s entered? Does that need to happen in a special way? All of these questions should be asked now so there is time to debate them instead of waiting until you’re in the middle of a disaster to solve them.


Tom’s Take

Disasters are never fun and we never really want them to happen. However we need to make sure we’re ready when they do. You need to have a plan for how to get everything back as well as how to keep doing everything you can until that happens. You may not be able to do 100% of the things you could before but if you don’t try to at least do some of them you’re going to lose a lot more in the long run. Have a plan and make sure everyone knows what to do when disaster strikes. Don’t count on getting everything back as the only way to recovery.

Should We Embrace Points of Failure?

failure-pic-1

There was a tweet making the rounds this last week that gave me pause. Max Clark said that we should embrace single points of failure in the network. His post was an impassioned plea to networking rock stars out there to drop redundancy out of their networks and instead embrace these Single Points of Failure (SPoF). The main points Mr. Clark made boil down to a couple of major statements:

  1. Single-device networks are less complex and easier to manage and troubleshoot. Don’t have multiple devices when an all-in-one works better.
  2. Consumer-grade hardware is cheaper and easier to understand, therefore it’s better. Plus, if you need a backup you can just buy a second one and keep it on the shelf.

I’m sure more networking pros out there are practically bristling at these suggestions. Others may read through the original tweet and think this was a tongue-in-cheek post. Let’s look at the argument logically and understand why this has some merit but is ultimately flawed.

Missing Minutes Matter

I’m going to tackle the second point first. The idea that you can use cheaper gear and have cold standby equipment just sitting on the shelf is one that I’ve heard of many times in the past. Why pay more money to have a hot spare or a redundant device when you can just stock a spare part and swap it out when necessary? If your network design decisions are driven completely by cost then this is the most appealing thing you could probably do.

I once worked with a technology director that insisted that we forgo our usual configuration of RAID-5 with a hot spare drive in the servers we deployed. His logic was that the hot spare drive was spinning without actually doing anything. If something did go wrong it was just as easy to slip the spare drive in by taking it off the shelf and firing it up then instead of running the risk that the drive might fail in the server. His logic seemed reasonable enough but there was one variable that he wasn’t thinking about.

Time is always the deciding factor in redundancy planning. In the world of backup and disaster recovery they use the acronym RTO, which stands for Recovery Time Objective. Essentially, how long do you want your systems to be offline before the data is restored? Can you go days without getting your data back? Or do you need it back up and running within hours or even minutes? For some organizations the RTO could even be measured in mere seconds. Every RTO measurement adds additional complexity and cost.

If you can go days without your data then a less expensive tape solution is best because it is the cheapest per byte stored and lasts forever. If your RTO is minutes or less you need to add hardware that replicates changes or mirrors the data between sites to ensure there is always an available copy somewhere out there. Time is the deciding factor here, just as it is in the redundancy example above.

Can your network tolerate hours of downtime while you swap in a part from off the shelf? Remember that you’re going to need to copy the configuration over to it and ensure it’s back up and running. If it is a consumer-grade device there probably isn’t an easy way to console in and paste the config. Maybe you can upload a file from the web GUI but the odds are pretty good that you’re looking at downtime at least in the half-hour range if not more. If your office can deal with that then Max’s suggestions should work just fine.

For organizations that need to be back up and running in less than hours, you need to have fault tolerance in your network. Redundant paths for traffic or multiple devices to eliminate single points of failure are the only way to ensure that traffic keeps flowing in the event of a hardware failure. Sure, it’s more complicated to troubleshoot. But the time you spend making it work correctly is not time you’re going to spend copying configurations to a cold device while users and stakeholders are yelling at you to get things back online.

All-In-Wonder

Let’s look at the first point here. Single box solutions are better because they are simple to manage and give you everything you could need. Why buy a separate switch, firewall, and access point when you can get them all in one package? This is the small office / branch office model. SD-WAN has even started moving down this path for smaller deployments by pushing all the devices you need into one footprint.

It’s not unlike the TVs you can buy in the big box stores that have DVD players, VHS players, and even some streaming services built in. They’re easy to use because there are no extra wires to plug in and no additional remote controls to lose. Everything works from one central location and it’s simple to manage. The package is a great solution when you need to watch old VHS tapes or DVDs from your collection infrequently.

Of course, most people understand the drawbacks of this model. Those devices can break. They are much harder to repair when they’re all combined. Worse yet, if the DVD player breaks and you need to get it repaired you lose the TV completely during the process instead of just the DVD player. You also can’t upgrade the components individually. Want to trade out that DVD for a Blu-Ray player? You can’t unless you install one on its own. Want to keep those streaming apps up-to-date? Better hope the TV has enough memory to keep current. Event state-of-the-art streaming boxes will eventually be incapable of running the latest version of popular software.

All-in-one devices are best left to the edges of the network. They function well in offices with a dozen or so workers. If something goes bad on the device it’s easier to just swap the whole thing instead of trying to repair the individual parts. That same kind of mentality doesn’t work quite so well in a larger data center. The fact that most of these unified devices don’t take rack mounting ears or fit into a standard data center rack should be a big hint that they aren’t designed for use in a place that keeps the networking pieces off of someone’s desk.


Tom’s Take

I smiled a bit when I read the tweet that started this whole post. I’m sure that the networks that Max has worked on work much better with consumer all-in-one devices. Simple configurations and cold spares are a perfectly acceptable solution for law offices or tag agencies or other places that don’t measure their downtime in thousands of dollars per second. I’m not saying he’s wrong. I’m saying that his solution doesn’t work everywhere. You can’t run the core of an ISP with some SMB switches. You should run your three-person law office with a Cat6500. You need to decide what factors are the most important for you. Don’t embrace failure without thought. Figure out how tolerant you or your customers are of failure and design around it as best you can. Once you can do that you’ll have a much better idea of how to build your network with the fewest points of failure.