Continuity is Not Recovery

It was a long weekend for me but it wasn’t quite as long as it could have been. The school district my son attends is in the middle of a ransomware attack. I got an email from them on Friday afternoon telling us to make sure that any district-owned assets are powered off until further notice to keep our home networks from being compromised. That’s pretty sound advice so we did it immediately.

I know that the folks working on the problem spent the whole weekend trying to clean it up and make sure there isn’t any chance of getting reinfected. However, I also wondered how that would impact school this week. The growing amount of coursework that happens online or is delivered via computer is large enough that going from that to a full stop of no devices is probably jarring. That got me to thinking once more about the difference between continuity and recovery

Keeping The Lights On

We talk about disaster recovery a lot. Backups of any kind are designed to get back what was lost. Whether it’s a natural disaster or a security incident you want to be able to recover things back to the way they were before the disaster. We talk about making sure the data is protected and secured, whether from attackers or floods or accidental deletion. It’s a sound strategy but I feel it’s a missing a key component.

Aside from getting your data back, which is called the recovery point objective (RPO), you also need to consider how long it’s going to take to get you there. That’s called the recovery time objective (RTO). RTO tells you how long it will be until you can get your stuff back. For a few files the RTO could be minutes. For an entire data center it could be weeks. The RTO can even change based on the nature of the disaster. If you lose power to the building due to a natural disaster you may not even be able to start recovery for days which will extend the RTO due to circumstances outside your control.

For a business or organization looking to stay up and running during a disaster, RTO is critical but so too is the need for business continuity. How critical is it? The category was renamed to “Disaster Recovery and Business Continuity” many years ago. It’s not enough to get your data back. You have to stay up and running as much as possible during the process. You’ve probably experienced this if you’ve ever been to a store that didn’t have working registers or the ability to process credit cards. How can you pay for something if you can’t ring it up or process a payment option?

Business continuity isn’t about the data. It’s about keeping the lights on while you recover it. In the case of my son’s school they’re going to teach the old fashioned way. Lectures and paper are going to replace videos and online quizzes. Teachers are thankfully very skilled in this manner. They’ve spent hundreds if not thousands of hours in a classroom instructing with a variety of techniques. Are your employees equally as skilled when everything goes down? Could they get the job done if your Exchange Server goes down or they’re unable to log into Salesforce?

Back to Good, Eventually

In order to make sure you have a business left to recover you need to have some sort of a continuity plan. Especially in a world where cyberattacks are common you need to know what you have to do to keep things going while you work on fixing the damage. Most bad actors are counting on you not being able to conduct business as a driver to pay the ransom. If you’re losing thousands of dollars per minute you’re more likely to cave in and pay than try to spend days or weeks recovering.

Your continuity plan needs to exist separately from your backup RTO objectives. It may sound pessimistic but you need to have a plan for what happens if the RTO is met but also one for what happens if you miss your RTO. You don’t want to count on a quick return to normal operations as your continuity plan only to find out you’re not going to get there.

The other important thing to keep in mind is that continuity plans need to be functional, not perfect. You use the systems you use for a reason. Credit card machines make processing payments quick and easy. If they’re down you’re not going to have the same functionality. Yes, using the old manual process with paper slips and carbon copies is a pain and takes time. It’s also the only way you’re going to be able to take those payments when you can’t use the computer.

You also need to plan around how to handle your continuity plan. If you’re suddenly using more paper, such as invoices or credit card slips, where do you store those? How will you process them once the systems come back online? Will you need to destroy anything after it’s entered? Does that need to happen in a special way? All of these questions should be asked now so there is time to debate them instead of waiting until you’re in the middle of a disaster to solve them.


Tom’s Take

Disasters are never fun and we never really want them to happen. However we need to make sure we’re ready when they do. You need to have a plan for how to get everything back as well as how to keep doing everything you can until that happens. You may not be able to do 100% of the things you could before but if you don’t try to at least do some of them you’re going to lose a lot more in the long run. Have a plan and make sure everyone knows what to do when disaster strikes. Don’t count on getting everything back as the only way to recovery.

Getting In Front of Future Regret

Yesterday I sat in on the keynote from Commvault Connections21 and participated in a live blog of it on Gestalt IT. There was a lot of interesting info around security, especially related to how backup and disaster recovery companies are trying to add value to the growing ransomware issue in global commerce. One thing that I did take away from the conversation wasn’t specifically related to security though and I wanted to dive into a bit more.

Reza Morakabati, CIO for Commvault, was asked what he thought teams needed to do to advance their data strategy. And his response was very insightful:

Ask your team to imagine waking up to hear some major incident has happened. What would their biggest regret be? Now, go to work tomorrow and fix it.

It’s a short, sweet, and powerful sentence. Technology professionals are usually focused on implementing new things to improve productivity or introduce new features to users and customers. We focus on moving fast and making people happy. Security is often seen as running counter to this ideal. Security wants to keep people safe and secure. It’s not unlike the parents that hold on to their child’s bicycle after the training wheels come off just to make sure the kids are safe. The kids want to ride and be free. The parents aren’t quite sure how secure they’re going to be just yet.

Regret Storming

Thought exercises make for entertaining ways to scare yourself to death some days. If you spend too much time thinking about all the ways that things can go wrong you’re going to spend far too much energy focused on the negative aspects of your work. However, you do need to occasionally open yourself up to the likelihood that things are going to go wrong at some point.

For the thought exercise above, it’s not crucial to think about how they could go wrong. It’s more important to think about what could be the worst thing that could happen as a result of those bad things and how much you’ll regret it. You need to identify those areas and try to figure out how they can be mitigated.

Let me give you a specific example from my area. In May 2013 a massive tornado ripped through Moore, OK just north of where I live. It was a tragic event that had loss of life. People were displaced and homes and businesses were destroyed. One of the places that was damaged severely was the Moore Public Schools administration building. In the aftermath of trying to clean up the debris and find survivors, one of my friends that worked for an IT vendor told me he spent hours helping sift through the rubble of the building looking for hard disk drives for the district’s servers. Why? Because the tornado had struck just before the payroll for the district’s teachers and staff was due to be run. Without the drives they couldn’t run payroll or print paychecks for those employees. With an even greater need to have funds to pay for food or start repairs on their homes you can imagine that not getting paid was going to be a big deal for those educators and staff.

There are a lot of regrets that came out of the May 2013 tornado. Loss of life and loss of property are always at the top of the list. The psychological damage of enduring something like that is also a huge impact. But for the school district one of the biggest regrets they faced was not having a contingency plan for what to do about paying their employees to help them deal with the disaster. It sounds small in comparison to the millions of dollars of damage that happened but it also represents something important that can be controlled. The school system can’t upgrade the warning system or build businesses that can withstand the most powerful storms imaginable. But they can fix their systems to prevent teachers from going without resources in the event of an emergency.

In this case, the regret is not being able to pay teachers if the district data center goes down. How could we fix that regret today if we had imagined it beforehand? We could have migrated the data center to the cloud so no one weather event could take it out. Likewise, we could have moved to a service that provides payroll entry and check printing that could be accessed from anywhere. We could also have encouraged our teachers and employees to use direct deposit functions to reduce the need to physically print checks. Technology today provides us with a number of solutions to the regret we face. We can put together plans to implement any one of them quickly. We just need to identify the problem and build a resolution for it.

Building Your Future

It’s not easy to foresee every possible outcome. Nor should it be. But if you focus on the feelings those unknown outcomes could bring you’ll have a much better sense for what’s important to protect and how to go about doing it. Are you worried your customer data is going to be stolen and shared on the Internet? Then you need to focus your efforts on protecting it. Are you concerned your AWS bill is going to skyrocket if someone steals your credentials and starts borrowing your resource pool? Then you need to have governance in place to prevent unauthorized users from doing that thing.

You don’t have to have a solution for every possible regret. You may even find that some of the things you thought you might end up regretting are actually pretty mild. If you’re not concerned about what would happen to your testing environment because you can just clone it from a repository then you can put that to bed and not worry about it any longer. Likewise, you may discover some regrets you didn’t anticipate. For example, if you’re using Active Directory credentials to back up your server data, you need to make sure you’re backing up Active Directory as well. You’re going to find yourself infuriated if you have the data you need to get back to business but it’s locked behind cryptographic locks that you can’t open because someone forgot to back up a domain controller.


Tom’s Take

I’ve been told that I’m somewhat negative because I’m always worried about what could go wrong with a project or an event. It’s not that I’m a pessimist as much as I’ve got a track record for seeing how things can go off the rails. Thanks to Commvault I’m going to spend more time thinking of my regrets and trying to plan for them to be mitigated ahead of time so all the possible ways things could fail won’t consume my thoughts. I don’t have to have a plan for everything. I just need to get in front of the regrets before I feel them for real.

What Can You Learn From Facebook’s Meltdown?

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

What’s Your Work From Home DR Plan?

It’s almost December and the signs are pointing to a continuation of the current state of working from home for a lot of people out there. Whether it’s a surge in cases that is causing businesses to close again or a change in the way your company looks at offices and remote work, you’re likely going to ring in the new year at your home keyboard in your pajamas with a cup of something steaming next to your desk.

We have all spent a lot of time and money investing in better conditions for ourselves at home. Perhaps it was a fancy new mesh chair or a more ergonomic keyboard. It could have been a bigger monitor with a resolution increase or a better webcam for the dozen or so Zoom meetings that have replaced the water cooler. There may even be more equipment in store, such as a better home wireless setup or even a corporate SD-WAN solution to help with network latency. However, have you considered what might happen if it all goes wrong and you need to be online?

In and Outage

Outages happen more often than we realize. That’s never been more evident than the situation we find ourselves in now. There are providers that do maintenance during the day because most of their customers are at work. When that work happened in a building covered on a different grid or service line it was fine to reboot things in the afternoon. When everyone is at home working on video calls or remote classrooms it’s no longer ideal. And those are just the planned outages. What about the ones that happen without warning?

Between the extra usage at home and the increased stress on the system, I’m finding that my Internet connection is becoming much less stable than it has been in the past. When you’re on a consumer-grade line, you pay for the privilege of getting online when it works. When it doesn’t you get lumped in with the same group of people in your neighborhood or on your local loop. The provider response is usually a shrug and a “we’re working on it” response. Business lines cost twice as much for less speed but gain the ability to call and at least file a complaint or a ticket. If you’re lucky enough to have one with a good SLA you might even get a truck roll to your location within a few hours. Otherwise, you need to plan for the worst.

This is nothing new to the enterprise. The best SLA in the world is only a piece of paper with a specific promise. It doesn’t protected from the North American Fiber Seeking Backhoe or an ice storm that knocks out power to a few square miles of your town. We need to have plans in place to deal with the potential for not having what we need when we need it. In the enterprise that was part of the job. We bought firewalls in pairs and had expensive power equipment run into the data center to protect our services. The cloud is armored against outages, so long as the engineers keep their fingers off of things. But our house is neither the enterprise or the cloud. How can we ensure we’re able to work when nothing else looks like it’s going to get the job done?

Planning To Recover

In order to keep working from home in the event of an outage, you need to consider three important situations: Connectivity Outage, Power Outage, and a Location Outage.

Connectivity outages are the most basic situation we’re going to find ourselves in. The Internet is down or severely degraded. We need to get online and figure out how to keep working. That means we need a traffic plan. And we need to figure out how to get it working. The first step is going to require you to figure out how you’re going to get back online. Do you want to use your cell phone as a tethering device? Do you want to “borrow” the neighbor’s wireless (with permission, of course). Do you want to try and work completely from your mobile device. You need to figure this out ahead of time. You also need to test it.

If you’re going to rely on your phone for tethering, make sure you have that option enabled ahead of time. Find out how much data you have available and what happens if you go over. Do you get throttled to a slower speed? Do you need to pay more? You don’t want to find out what happens in the middle of a critical call. You also need to test the speed of the tethering at your house. For example, my LTE coverage at home is pretty terrible, so I need to fail back to 3G in order to have a stable signal. That means no video calls for me until my broadband connection comes back online.

If you plan is use your neighboring connection for a backup, please get permission first. You never want to find out someone is borrowing your network and you don’t want to do that to someone else. You also should verify that the neighboring connection is a diverse circuit. It won’t do you much good to hop on their connection only to find out they’re on the same provider and everything is offline for the entire block. Test it ahead of time and make sure their data plan works for you. And remember that you’re doubling the amount of traffic pouring through their circuit. You’re going to have to decide, as above, what traffic is critical to fail over. And give your neighbor a heads up so they don’t panic when everything gets slower.

Limited Power

Electricity is a bigger issue because it affects everything at a lower level of the stack. You can have bad connectivity and just work offline with your devices. But a power outage changes the game because half your devices may be out of commission. If my power goes out, my 4K monitor goes with it. That means I need to work from my laptop until I can get everything under control again. My time is limited to how long my laptop battery can hold out.

If you want to solve the power outage problem, you need to solve your power issues. The easiest way is a backup battery system, like a personal uninterruptible power supply (UPS) under your desk. Remember that a UPS isn’t designed to keep you running for days. It’s something that is designed to keep you going just long enough to put a plan in place or shut things down gracefully. Those batteries last for 5-7 minutes at best. And the more devices you have connected to them, the less time they can stay up.

What other devices, you might ask? Keeping your computer on a UPS is smart. What about your WAN connectivity device, like your DSL or cable modem? Have to have that if you want to stay online, right? What about your wireless access points? Are they plugged into the wall? Or are you using PoE? Did you plug the PoE switch into the UPS too? That is going to reduce the UPS runtime. Unlike a data center, your house likely doesn’t have the infrastructure to run a huge UPS with enough battery to go for half an hour. You need something that is going to fit under your desk and has a fan small enough to not create a white noise generator.

Speaking of generators, if you have regular power issues or you live in a part of the country that is prone to storms that knock out power for days on end, you need to consider a home generator. These are larger, more expensive, and require some kind of fuel source. The upside is you can run your entire house off the generated electricity until the power is restored. No need to hook things up to batteries. The down side is that your house uses a lot of power for appliances and other things and you still need to prioritize what is going to get first crack at the leftover juice. You also need to have a plan to keep the generator going. If it runs off of combustible fuel, like diesel, you need to have a supply ready to go. You also need to test it regularly to ensure it’s going to work when you need it. Just like in the enterprise data center you need to know things work before everything goes sideways.

Getting Out of Dodge

The last situation can be a combination of the above factors or something totally unrelated. What happens if your workspace isn’t usable? Maybe your power is out and your heat is gone too. Maybe your Internet is down and you need to go somewhere with a better signal in order to have that big call with the CEO. Maybe you have a tree trimming service that has set up camp outside your office window for the rest of the day and the chainsaw symphony is driving you insane. If you need to move, you need to have a plan.

Where are you going to go? A friend’s house works great, provided they are home and there isn’t a quarantine order. You could try a coffee shop, provided they aren’t closed for some reason. Maybe you’re alone in a distant city and you need to get some work done. You need to figure out what’s available and have at least two backup plans in place. Maybe you want to use a local coffee shop. Make sure they’re open and make sure you aren’t violating local laws. If they’re closed, consider somewhere more local to you. Perhaps the parking lot of a store or other place that offers wireless connectivity. There are times when those wireless setups extend a bit outside the store proper and you can borrow it there. Not everyone has a vehicle, so make sure wherever you are going is easy to reach through your preferred transit method. Oh, and if it’s a restaurant or coffee shop, make sure you tip well on your purchases as a form of “rent” for the table.


Tom’s Take

Enterprises have DR plans in place for everything. Natural disasters, security incidents, and even good old fashioned human error all have a place in the Big Binder of Getting Back to Work. Homes don’t have that, even though they should. You need to know what has to happen to get you back to working if something goes wrong. You need to write it all down, test it thoroughly, and keep updating it as you go. Your boss may understand the first time you can’t work because your power went out or they’re doing circuit maintenance in your neighborhood. But, as an IT professional, you need to have a plan in place in case it becomes a regular occurrence. And when you do get everything ready to go for your home DR plan, make sure you update your enterprise DR plans too.