Outing Your Outages

How are you supposed to handle outages? What happens when everything around you goes upside down in an instant? How much communication is “too much”? Or “not enough”? And is all of this written down now instead of being figured out when the world is on fire?

Team Players

You might have noticed this week that Webex Teams spent most of the week down. Hard. Well, you might have noticed if you used Microsoft Teams, Slack, or any other messaging service that wasn’t offline. Webex Teams went offline about 8:00pm EDT Monday night. At first, most people just thought it was a momentary outage and things would be back up. However, as the hours wore on and Cisco started updating the incident page with more info it soon became apparent that Teams was not coming back soon. In fact, it took until Thursday for most of the functions to be restored from whatever knocked them offline.

What happened? Well, most companies don’t like to admit what exactly went wrong. For every CloudFlare or provider that has full disclosures on their site of outages, there are many more companies that will eventually release a statement with the least amount of technical detail possible to avoid any embarrassment. Cisco is currently in the latter category, with most guesses landing on some sort of errant patch that mucked things up big time behind the scenes.

It’s easy to see when big services go offline. If Netflix or Facebook are down hard then it can impact the way we go about our lives. On the occasions when our work tools like Slack or Google Docs are inoperable it impacts our productivity more than our personal pieces. But each and every outage does have some lessons that we can take away and learn for our own IT infrastructure or software operations. Don’t think that companies that are that big and redundant everywhere can’t be affected by outages regularly.

Stepping Through The Minefield

How do you handle your own outage? Well, sometimes it does involve eating some humble pie.

  1. Communicate – This one is both easy and hard. You need to tell people what’s up. You need to let everyone know things are working right and you’re working to make them right. Sometimes that means telling people exactly what’s affected. Maybe you can log into Facebook but not Chat or Messages. Tell people what they’re going to see. If you don’t communicate, you’re going to have people guessing. That’s not good.
  2. Triage – Figure out what’s wrong. Make educated guesses if nothing stands out. Usually, the culprits are big changes that were just made or perhaps there is something external that is affecting your performance. The key is to isolate and get things back as soon as possible. That’s why big upgrades always have a backout plan. In the event that things go sideways, you need to get back to functional as soon as you can. New features that are offline aren’t as good as tried-and-true stuff that’s reachable.
  3. Honest Post-Mortem – This is the hardest part. Once you have things back in place, you have to figure out why the broke. This is usually where people start running for the hills and getting evasive. Did someone apply a patch at the wrong time? Did a microcode update get loaded to the wrong system? How can this be prevented in the future? The answers to these questions are often hard to get because the people that were affected and innocent often want to find the guilty parties and blame someone so they don’t look suspect. The guilty parties want to avoid blame and hide in public with the rest of the team. You won’t be able to get to the bottom of things unless you find out what went wrong and correct it. If it’s a process, fix it. If it’s a person, help them. If it’s a strange confluence of unrelated events that created the perfect storm, make sure that can never happen again.
  4. Communicate (Again) – This is usually where things fall over for most companies. Even the best ones get really good at figuring out how to prevent problems. However, most of them rarely tell anyone else what happened. They hide it all and hope that no one ever asks about anything. Yet, transparency is key in today’s world. Services that bounce up and down for no reason are seen as unstable. Communicating as to their volatility is the only way you can make sure that people have faith that they’re going to stay available. Once you’ve figure out what went wrong and who did it, you need to tell someone what happened. Because the alternative is going to be second guessing and theories that don’t help anyone.

Tom’s Take

I don’t envy the people at Cisco that spent their entire week working to get Webex Teams back up and running. I do appreciate their work. But I want to figure out where they went wrong. I want to learn. I want to say to myself, “Never do that thing that they did.” Or maybe it’s a strange situation that can be avoided down the road. The key is communication. We have to know what happened and how to avoid it. That’s the real learning experience when failure comes around. Not the fix, but the future of never letting it happen again.

Advertisements

Tale From The Trenches: The Debug Of Damocles

My good friend and colleague Rich Stroffolino (@MrAnthropology) is collecting Tales from the Trenches about times when we did things that we didn’t expect to cause problems. I wanted to share one of my own here about the time I knocked a school offline with a debug command.

I Got Your Number

The setup for this is pretty simple. I was deploying a CallManager setup for a multi-site school system deployment. I was using local gateways at every site to hook up fax lines and fire alarms with FXS/FXO ports for those systems to dial out. Everything else got backhauled to a voice gateway at the high school with a PRI running MGCP.

I was trying to figure out why the station IDs that were being send by the sites weren’t going out over caller ID. Everything was showing up as the high school number. I needed to figure out what was being sent. I was at the middle school location across town and trying to debug via telnet. I logged into the router and figured I would make a change, dial my cell phone from the VoIP phone next to me, and see what happened. Simple troubleshooting, right?

I did just that. My cell phone got the wrong caller ID. And my debug command didn’t show me anything. So I figured I must be debugging the wrong thing. Not debug isdn q931 after all. Maybe it’s a problem with MGCP. I’ll check that. But I just need to make sure I’m getting everything. I’ll just debug it all.

debug mgcp packet detail

Can You Hear Me Now?

Veterans of voice are probably screaming at me right now. For those who don’t know, debug anything detail generates a ton of messages. And I’m not consoled into the router. I’m remote. And I didn’t realize how big of a problem that was until my console started scrolling at 100 miles an hour. And then froze.

Turns out, when you overwhelm a router CPU with debug messages, it shuts off the telnet window. It also shuts off the console as well, but I wouldn’t have known that because I was way far way from that port. But I did starting hearing people down the hall saying, “Hello? Hey, are you still there? Weird, it just went dead.”

Guess what else a router isn’t doing when it’s processing a tidal wave of debug messages? It’s not processing calls. At all. For five school sites. I looked down at my watch. It was 2:00pm. That’s bad. Elementary schools get a ton of phone calls within the last hour of being in session. Parents calling to tell kids to wait in a pickup line or ride a certain bus home. Parents wanting to check kids out early. All kinds of things. That need phones.

I raced out of my back room. I acknowledged the receptionists comment about the phones not working. I jumped in my car and raced across town to the high school. I managed not to break any speed limits, but I also didn’t loiter one bit. I jumped out of my car and raced into the building. The look on my face must have warded off any comments about phone system issues because no one stopped me before I got to the physical location of the voice gateway.

I knew things were bad. I didn’t have time to console in and remove the debug command. I did what ever good CCIE has been taught since the beginning of time when they need to remove a bad configuration that broke their entire lab.

I pulled the power cable and cycled the whole thing.

I was already neck deep in it. It would have taken me at least five minutes to get my laptop ready and consoled in. In hindsight, that would have been five wasted minutes since the MGCP debugger would have locked out the console anyway. As the router was coming back up, I nervously looked at the terminal screen for a login prompt. Now that the debugger wasn’t running, everything looked normal. I waiting impatiently for the MGCP process to register with CallManager once more.

I kept repeating the same status CLI command while I refreshed the gateway page in CallManager over and over. After a few more tense minutes, everything was back to normal. I picked up a phone next to the rack and dialed my cell phone. It rang. I was happy. I walked back to the main high school office and told them that everything was back to normal.

Tom’s Take

My post-mortem was simple. I did dumb things. I shouldn’t have debugged remotely. I shouldn’t have used the detail keyword for something so simple. In fact, watching my screen fill up with five sites worth of phone calls in a fraction of a second told me there was too much going on behind the scenes for me to comprehend anyway.

That was the last time I ever debugged anything in detail. I made sure from that point forward to start out small and then build from there to find my answers. I also made sure that I did all my debugging from the console and not a remote access window. And the next couple of times I did it were always outside of production hours with a hand ready to yank the power cable just in case.

I didn’t save the day. At best, all I did was cover for my mistake. If it had been a support call center or a hospital I probably would have been fired for it. I made a bad decision and managed to get back to operational without costing money or safety.

Remember when you’re doing your job that you need to keep an eye on how your job will affect everything else. We don’t troubleshoot in a vacuum after all.

Is Patching And Tech Support Bad? A Response

Hopefully, you’ve had a chance to watch this 7 minute video from Greg Ferro about why better patching systems can lead to insecure software. If you haven’t, you should:

Greg is right that moral hazard is introduced because, by definition, the party providing the software is “insured” against the risks of the party using the software. But, I also have a couple of issues with some of the things he said about tech support.

Are You Ready For The Enterprise

I’ve been working with some Ubiquiti access points recently. So far, I really enjoy them and I’m interested to see where their product is going. After doing some research, the most common issue with them seems to be their tech support offerings. A couple of Reddit users even posted in a thread that the lack of true enterprise tech support is the key that is keeping Ubiquiti from reaching real enterprise status.

Think about all the products that you’ve used over the last couple of years that offered some other kind of support aside from phone or rapid response. Maybe it was a chat window on the site. Maybe it was an asynchronous email system. Hell, if you’ve ever installed Linux in the past on your own you know the joys of heading to Google to research an issue and start digging into post after post on a long-abandoned discussion forum. DenverCoder9 would agree.

By Greg’s definition, tech support creates negative incentive to ship good code because the support personnel are there to pick up the pieces for all your mistakes when writing code. It even incentivizes people to set unrealistic goals and push code out the door to hit milestones. On the flip side, try telling people your software is so good that you don’t need tech support. It will run great on anything every time and never mess up once. If you don’t get laughed out of the building you should be in sales.

IT professionals are conditioned to need tech support. Even if they never call the number once they’re going to want the option. In the VAR world, this is the “throat to choke” mentality. When the CEO is coming down to find out why the network is down or his applications aren’t working, they are usually not interested in the solution. They want to yell at someone to feel like they’ve managed the problem. Likewise, tech support exists to find the cause of the problem in the long run. But it’s also a safety blanket that exists for IT professionals to have someone to yell at after the CEO gets to them. Tech Support is necessary. Not because of buggy code, but because of peace of mind.

How Complex Are You?

Greg’s other negative aspect to tech support is that it encourages companies to create solutions more complex than they need to be because someone will always be there to help you through them. Having done nationwide tech support for a long time, I can tell you that most of the time that complexity is there for a reason. Either you hide it behind a pretty veneer or you risk having people tell you that your product is “too simple”.

A good example case here is Meraki. Meraki is a well-known SMB/SME networking solution. Many businesses run on Meraki and do it well. Consultants sell Meraki by the boatload. You know who doesn’t like Meraki? Tinkerers. People that like to understand systems. The people that stay logged into configuration dashboards all day long.

Meraki made the decision when they started building their solution that they weren’t going to have a way to expose advanced functionality in the system to the tinkerers. They either buried it and set it to do the job it was designed to do or they left it out completely. That works for the majority of their customer base. But for those that feel they need to be in more control of the solution, it doesn’t work at all. Try proposing a Meraki solution to a group of die-hard traditionalist IT professionals in a large enterprise scenario. Odds are good you’re going to get slammed and then shown out of the building.

I too have my issues with Meraki and their ability to be a large enterprise solution. It comes down to choosing to leave out features for the sake of simplicity. When I was testing my Ubiquiti solution, I quickly found the Advanced Configuration setting and enabled it. I knew I would never need to touch any of those things, yet it made me feel more comfortable knowing I could if I needed to.

Making a solution overly complex isn’t a design decision. Sometimes the technology dictates that the solution needs to be complex. Yet, we knock solutions that require more than 5 minutes to deploy. We laud companies that hide complexity and claim their solutions are simple and easy. And when they all break we get angry because, as tinkerers, we can’t go in and do our best to fix them.


Tom’s Take

In a world where software in king, quality matters. But, so does speed and reaction. You can either have a robust patching system that gives you the ability to find and repair unforeseen bugs quickly or you can have what I call the “Blizzard Model”. Blizzard is a gaming company that has been infamous for shipping things “when they’re done” and not a moment before. If that means a project is 2-3 years late then so be it. But, when you’re a game company and this is your line of work you want it to be perfect out of the box and you’re willing to forego revenue to get it right. When you’re a software company that is expected to introduce new features on a regular cadence and keep the customers happy it’s a bit more difficult.

Perhaps creating a patching system incentivizes people to do bad things. Or, perhaps it incentivizes them to try new things and figure out new ideas without the fear that one misstep will doom the product. I guess it comes down to whether or not you believe that people are inherently going to game the system or not.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.

How Should We Handle Failure?

I had an interesting conversation this week with Greg Ferro about the network and how we’re constantly proving whether a problem is or is not the fault of the network. I postulated that the network gets blamed when old software has a hiccup. Greg’s response was:

Which led me to think about why we have such a hard time proving the innocence of the network. And I think it’s because we have a problem with applications.

Snappy Apps

Writing applications is hard. I base this on the fact that I am a smart person and I can’t do it. Therefore it must be hard, like quantum mechanics and figuring out how to load the dishwasher. The few people I know that do write applications are very good at turning gibberish into usable radio buttons. But they have a world of issues they have to deal with.

Error handling in applications is a mess at best. When I took C Programming in college, my professor was an actual coder during the day. She told us during the error handling portion of the class that most modern programs are a bunch of statements wrapped in if clauses that jump to some error condition in if they fail. All of the power of your phone apps is likely a series of if this, then that, otherwise this failure condition statements.

Some of these errors are pretty specific. If an application calls a database, it will respond with a database error. If it needs an SSL certificate, there’s probably a certificate specific error message. But what about those errors that aren’t quite as cut-and-dried? What if the error could actually be caused by a lot of different things?

My first real troubleshooting on a computer came courtesy of Lucasarts X-Wing. I loved the game and played the training missions constantly until I was able to beat them perfectly just a few days after installing it. However, when I started playing the campaign, the program would crash to DOS when I started the second mission with an out of memory error message. In the days before the Internet, it took a lot of research to figure out what was going on. I had more than enough RAM. I met all the specs on the side of the box. What I didn’t know and had to learn is that X-Wing required the use of Expanded Memory (EMS) to run properly. Once I decoded enough of the message to find that out, I was able to change things and make the game run properly. But I had to know that the memory X-Wing was complaining about was EMS, not the XMS RAM that I needed for Windows 3.1.

Generic error messages are the bane of the IT professional’s existence. If you don’t believe me, ask yourself how many times you’ve spent troubleshooting a network connectivity issue for a server only to find that DNS was misconfigured or down. In fact, Dr. House has a diagnosis for you:

Unintuitive Networking

If the error messages are so vague when it comes to network resources and applications, how are we supposed to figure out what went wrong and how to fix it? I mean, humans have been doing that for years. We take the little amount of information that we get from various sources. Then we compile, cross-reference, and correlate it all until we have enough to make a wild guess about what might be wrong. And when that doesn’t work, we keep trying even more outlandish things until something sticks. That’s what humans do. We will in the gaps and come up with crazy ideas that occasionally work.

But computers can’t do that. Even the best machine learning algorithms can’t extrapolate data from a given set. They need precise inputs to find a solution. Think of it like a Google search for a resolution. If you don’t have a specific enough error message to find the problem, you aren’t going to have enough data to provide a specific fix. You will need to do some work on your own to find the real answer.

Intent-based networking does little to fix this right now. All intent-based networking products are designed to create a steady state from a starting point. No matter how cool it looks during the demo or how powerful it claims to be, it will always fall over when it fails. And how easily it falls over is up to the people that programmed it. If the system is a black box with no error reporting capabilities, it’s going to fail spectacularly with no hope of repair beyond an expensive support contract.

It could be argued that intent-based networking is about making networking easier. It’s about setting things up right the first time and making them work in the future. Yet, no system in a steady state works forever. With the possible exception of the tar pitch experiment, everything will fail eventually. And if the control system in place doesn’t have the capability to handle the errors being seen and propose a solution, then your fancy provisioning system is worth about as much as a car with no seat belts.

Fixing The Issues

So, how do we fix this mess? Well, the first thing that we need to do is scold the application developers. They’re easy targets and usually the cause behind all of these issues. Instead of giving us a vague error message about network connectivity, we need more like lp0 Is On Fire. We need to know what was going on when the function call failed. We need a trace. We need context around process to be able to figure out why something happened.

Now, once we get all these messages, we need a way to correlate them. Is that an NMS or some other kind of platform? Is that an incident response system? I don’t know the answer there, but you and your staff need to know what’s going on. They need to be able to see the problems as they occur and deal with the errors. If it’s DNS (see above), you need to know that fast so you can resolve the problem. If it’s a BGP error or a path problem upstream, you need to know that as soon as possible so you can assess the impact.

And lastly, we’ve got to do a better job of documenting these things. We need to take charge of keeping track of what works and what doesn’t. And keeping track of error messages and their meanings as we work on getting app developers to give us better ones. Because no one wants to be in the same boat as DenverCoder9.


Tom’s Take

I made a career out of taking vague error messages and making things work again. And I hated every second of it. Because as soon as I became the Montgomery Scott of the Network, people started coming to me with every more bizarre messages. Becoming the Rosetta Stone for error messages typed into Google is not how you want to spend your professional career. Yet, you need to understand that even the best things fail and that you need to be prepared for it. Don’t spend money building a very pretty facade for your house if it’s built on quicksand. Demand that your orchestration systems have the capabilities to help you diagnose errors when everything goes wrong.

 

Is It Really Always The Network?

Keep Calm and Blame the Network

Image from Thomas LaRock

I had a great time over the last month writing a series of posts with my friend John Herbert (@MrTugs) over on the SolarWinds Geek Speak Blog. You can find the first post here. John and I explored the idea that people are always blaming the network for a variety of issues that are often completely unrelated to the actual operation of the network. It was fun writing some narrative prose for once, and the feedback we got was actually pretty awesome. But I wanted to take some time to explain the rationale behind my madness. Why is it that we are always blaming the network?!?

Visibility Is Vital

Think about all the times you’ve been working on an application and things start slowing down. What’s the first thing you think of? If it’s a standalone app, it’s probably some kind of processing lag or memory issues. But if that app connects to any other thing, whether it be a local network or a remote network via the Internet, the first culprit is the connection between systems.

It’s not a large logical leap to make. We have to start by assuming the the people that made the application knew what they were doing. If hundreds of other people aren’t having this problem, it must not be with the application, right? We’ve already started eliminating the application as the source of the issues even before we start figuring out what went wrong.

People will blame the most visible part of the system for issues. If that’s a standalone system sealed off from the rest of the world, it obviously must be the application. However, we don’t usually build these kinds of walled-off systems any longer. Almost every application in existence today requires a network connection of some kind. Whether it’s to get updates or to interact with other data or people, the application needs to talk to someone or maybe even everyone.

I’ve talked before about the need to make the network more of a “utility”. Part of the reason for this is that it lowers the visibility of the network to the rest of the IT organization. Lower visibility means fewer issues being incorrectly blamed on the network. It also means that the network is going to be doing more to move packets and less to fix broken application issues.

Blame What You Know

If your network isn’t stable to begin with, it will soon become the source of all issues in IT even if the network has nothing to do with the app. That’s because people tend to identify problem sources based on their own experience. If you are unsure of that, work on a consumer system helpdesk sometime and try and keep track of the number of calls that you get that were caused by “viruses” even if there’s no indication that this is a virus related issue. It’s staggering.

The same thing happens in networking and other enterprise IT. People only start troubleshooting problems from areas of known expertise. This usually breaks down by people shouting out solutions like, “I saw this once so it must be that! I mean, the symptoms aren’t similar and the output is totally different, but it must be that because I know how to fix it!”

People get uncomfortable when they are faced with troubleshooting something unknown to them. That’s why they fall back on familiar things. And if they constantly hear how the network is the source of all issues, guess what the first thing to get blamed is going to be?

Network admins and engineers have to fight a constant battle to disprove the network as the source of issues. And for every win they get it can come crashing down when the network is actually the problem. Validating the fears of the users is the fastest way to be seen as the issue every time.

Mean Time To Innocence

As John and I wrote the pieces for SolarWinds, what we wanted to show is that a variety of issues can look like network things up front but have vastly different causes behind the scenes. What I felt was very important for the piece was the distinguish the the main character, Amanda, go beyond the infamous Mean Time To Innocence (MTTI) metric. In networking, we all too often find ourselves going so far as to prove that it’s not the network and then leave it there. As soon as we’re innocent, it’s done.

Cross-function teams and other DevOps organizations don’t believe in that kind of boundary. Lines between silos blur or are totally absent. That means that instead of giving up once you prove it’s not your problem, you need to work toward fixing what’s at issue. Fix the problem, not the blame. If you concentrate on fixing the problems, it will soon become noticeable that networking team may not always be the problem. Even if the network is at fault, the network team will work to fix it and any other issues that you see.


Tom’s Take

I love the feedback that John and I have gotten so far on the series we wrote. Some said it feels like a situation they’ve been in before. Others have said that they applaud the way things were handled. I know that the narrative allows us to bypass some of the unsavory things that often happen, like argument and political posturing inside an organization to make other department heads look bad when a problem strikes. But what we really wanted to show is that the network is usually the first to get blamed and the last to keep its innocence in a situation like this.

We wanted to show that it’s not always the network. And the best way for you to prove that in your own organization is to make sure the network isn’t just innocent, but helpful in solving as many problems as possible.

Repeat After Me

callandresponse

Thanks to Tech Field Day, I fly a lot. As Southwest is my airline of choice and I have status, I tend to find myself sitting the slightly more comfortable exit row seating. One of the things that any air passenger sitting in the exit row knows by heart is the exit row briefing. You must listen to the flight attendant brief you on the exit door operation and the evacuation plan. You are also required to answer with a verbal acknowledgment.

I know that verbal acknowledgment is a federal law. I’ve also seen some people blatantly disregard the need to verbal accept responsibility for their seating choice, leading to some hilarious stories. But it also made me think about why making people talk to you is the best way to make them understand what you’re saying

Sotto Voce

Today’s society full of distractions from auto-play videos on Facebook to Pokemon Go parks at midnight is designed to capture the attention span of a human for a few fleeting seconds. Even playing a mini-trailer before a movie trailer is designed to capture someone’s attention for a moment. That’s fine in a world where distraction is assumed and people try to multitask several different streams of information at once.

People are also competing for noise attention as well. Pleasant voices are everywhere trying to sell us things. High volume voices are trying to cut through the din to sell us even more things. People walk around the headphones in most of the day. Those that don’t pollute what’s left with cute videos that sound like they were recorded in an echo chamber.

This has also led to a larger amount of non-verbal behavior being misinterpreted. I’ve experienced this myself on many occasions. People distracted by a song on their phone or thinking about lyrics in their mind may nod or shake their head in rhythm. If you ask them a question just before the “good part” and they don’t hear you clearly, they may appear to agree or disagree even though they don’t know what they just heard.

Even worse is when you ask someone to do something for you and they agree only to turn around and ask, “What was it you wanted again?” or “Sorry, I didn’t catch that.” It’s become acceptable in society to agree to things without understanding their meaning. This leads to breakdowns in communication and pieces of the job left undone because you assume someone was going to do something when they agreed, yet they agreed and then didn’t understand what they were supposed to do.

Fortississimo!

I’ve found that the most effective way to get someone to understand what you’ve told them is to ask you to repeat it back in their own words. It may sound a bit silly to hear what you just told them, but think about the steps that they must go through:

  • They have to stop for moment and think about what you said.
  • They then have to internalize the concepts so they understand them.
  • They then must repeat back to you those concepts in their own phrasing.

Those three steps mean that the ideas behind what you are stating or asking must be considered for a period of time. It means that the ideas will register and be remembered because they were considered when repeating them back to the speaker.

Think about this in a troubleshooting example. A junior admin is supposed to go down the hall and tell you when a link light comes for port 38. If the admin just nods and doesn’t pay attention, ask them to repeat those instructions back. The admin will need to remember that port 38 is the right port and that they need to wait until the link light is on before saying something. It’s only two pieces of information, but it does require thought and timing. By making the admin repeat the instructions, you make sure they have them down right.


Tom’s Take

Think about all the times recently when someone has repeated something back to you. A food order or an amount of money given to you to pay for something. Perhaps it was a long list of items to accomplish for an event or a task. Repetition is important to internalize things. It builds neural pathways that force the information into longer-term memory. That’s why a couple of seconds of repetition are time well invested.