About networkingnerd

Tom Hollingsworth, CCIE #29213, is a former network engineer and current organizer for Tech Field Day. Tom has been in the IT industry since 2002, and has been a nerd since he first drew breath.

On Old Configs and Automation

I used to work with a guy that would configure servers for us and always include an extra SCSI card in the order. When I asked him about it one day, he told me, “I left it out once and it delayed the project. So now I just put them on every order.” Even after I explained that we didn’t need it over and over again, he assured me one day we might.

Later, when I started configuring networking gear I would always set a telnet password for every VTY line going into the switch. One day, a junior network admin asked me why I configured all 15 instead of just the first 5 like they learn in the Cisco guides. I shrugged my shoulders and just said, “That’s how I’ve always done it.”

The Old Ways

There’s no more dangerous phrase than “That’s the way it’s always been.”

Time and time again we find ourselves falling back on the old rule of thumb or an old working configuration that we’ve made work for us. It’s comfortable for the human mind to work from a point of reference toward new things. We find ourselves doing it all the time. Whether it’s basing a new configuration on something we’ve used before or trying to understand a new technology by comparing it to something we’ve worked on in the past.

But how many times have those old configurations caused us grief? How many times have we been troubleshooting a problem only to find that we configured something that shouldn’t have been configured in the way that it was. Maybe it was an old method of doing hunt groups. Or perhaps it was a legacy QoS configuration that isn’t supported any more.

Part of our issue as networking professionals is that we want to concentrate on the important things. We want to implement solutions and ideas that work for our needs and not concentrate on the minutia of configuration. Sure, the idea of configuration a switch from bare metal to working config is awesome the first time you do it. But the fifteenth time you have to configure one in a row is less awesome. That’s why copy-and-paste configurations are so popular with people that just want to get the job done.

New Hotness

This idea of using old configurations for new things takes even more importance when you start replacing the boring old configuration methods with new automation and API-driven configuration models. Yes, APIs make it a lot easier to configure a switch programmatically. And automation tools like Puppet and Ansible make it much faster to bring a switch online from nothing to running in the minimum amount of time.

However, even with this faster configuration method, are we still using old, outdated configurations to solve problems? Sure, I don’t have to worry about configuring VLANs on the switch one at a time. But if my configuration is still referencing VLANs that are no longer in the system that makes it very difficult to keep the newer switches running optimally. And that’s just assuming the configuration is old and outdated. What if we’re still using deprecated commands?

APIs are great because they won’t support unsupported things. But if we don’t scrub the configuration now and then to remove these old lines and protocols we’ll quickly find ourselves in a world of trouble. Because those outdated and broken things will bring the API to a halt. Yes, the valid commands will still be entered correctly, but if those valid commands rely on something invalid to work properly you’re going find things broken very fast.

What makes this whole thing even more irritating is that most configurations need to be validated and approved before being implemented. Which makes the whole process even more difficult to manage. Part of the reason why old configs live for so long is that they need weeks or months of validation to be implemented effectively. When new platforms or configuration methods crop up it could delay new equipment installation. This sometimes leads to people installing new gear with “approved” configs that might not be the best fit in order to get that new equipment into service.


Tom’s Take

So, how do we fix all this? What’s the trick? Well, it’s really a combination of things. We need to make sure we audit configs regularly to keep the old stuff from continuing on past the expiration dates. We also need to continually resubmit new configurations to the approvals process. Just like disaster recovery documentation, configurations are living, breathing documents that should always be current and refreshed. The more current your configurations, the less likely you are to have old cruft keeping your systems running at subpar performance. And the less likely you are to have to worry about breaking new things like APIs and automation systems in the future.

Advertisements

Reclaiming 1.1.1.1 For The Internet

Hopefully by now you’ve seen the announcement that CloudFlare has opened a new DNS service at the address of 1.1.1.1. We covered a bit of it on this week’s episode of the Gestalt IT Rundown. Next to Gmail, it’s probably the best April Fool’s announcement I’ve seen. However, it would seem that the Internet isn’t quite ready for a DNS resolver service that’s easy to remember. And that’s thanks in part to the accumulation of bad address hygiene.

Not So Random Numbers

The address range of 1/8 is owned by APNIC. They’ve had it for many years now but have never announced it publicly. Nor have they ever made any assignments of addresses in that space to clients or customers. In a world where IPv4 space is at a premium, why would a RIR choose to lose 16 million addresses?

Edit: As pointed out by Dale Carder of ES.net in a comment below, APNIC has been assigning address space out of 1 /8 since 2010. However, the most commonly leaked prefixes in that subnet that are difficult to assign because of bogus announcements come from 1.0.0.0/14.

As it turns out, 1/8 is a pretty bad address space for two reasons. 1.1.1.1 and 1.2.3.4. These two addresses are responsible for most of the inadvertent announcements in the entire 1/8 space. 1.2.3.4 is easy to figure out. It’s the most common example IP address given when talking about something. Don’t believe me? Google is your friend. Instead of using 192.0.2.0/24 like we should be using, we instead use the most common PIN, password, and luggage combination in the world. But, at least 1.2.3.4 makes sense.

Why is 1.1.1.1 so popular? Well, the first reason is thanks to Airespace wireless controllers. Airespace uses 1.1.1.1 as the default virtual interface address for just about everything. Here’s a good explanation from Andrew von Nagy. When Airespace was sold to Cisco, this became a very popular address for Cisco wireless networks. Except now that it’s in use as a DNS resolver there are issues with using it. The wireless experts I’ve talked to recommend changing that address to 192.0.2.1, since that address has been marked off for examples only and will never be globally routable.

The other place where 1.1.1.1 seems to be used quite frequently is in Cisco ASA failover interfaces. Cisco documentation recommended using 1.1.1.1 for the primary ASA failover and 1.1.1.2 as the secondary interface. The heartbeats between those two interfaces were active as long as the units were paired together. But, if they were active and reachable then any traffic destined for those globally routable addresses would be black holed. Now, ASAs should probably be using 192.0.2.1 and 192.0.2.2 instead. But beware that this will likely require downtime to reconfigure.

The 1.1.1.1 address confusion doesn’t stop there. Some systems like Nomadix use them as the default logout address. Vodafone used to use it as an image caching server. ISPs are blocking it upstream in some ACLs. Some security organizations even recommend dropping traffic to 1/8 as a bogon prevention measure. There’s every chance that 1.1.1.1 is going to be dropped by something in your network or along the transit path.

Planning Not To Fail

So, how are you going to proceed if you really, really want to use CloudFlare DNS? Well, the first step is to make sure that you don’t have 1.1.1.1 configured anywhere in your network. That means checking WLAN controllers, firewalls, and example configurations. Odds are good you’re running RFC1918 space. But you should try to ping 1.1.1.1 anyway. If you can ping it, then you should traceroute the address. If the traceroute leaves your local network, you probably have a good path.

Once you’ve determined that you’re capable of reaching 1.1.1.1, you need to test it first. Configure it on a test machine or VM and make sure it’s actually resolving addresses for you. Better safe than sorry. Once you know it’s really working like you want it to work, configure it on your internal DNS servers as a forwarder. You still want internal control of DNS thanks to things like Active Directory. But configuring it as a forwarder means you can take advantage of all the features CloudFlare is building into the system while still retaining anything you’ve done locally.


Tom’s Take

I’m glad CloudFlare and APNIC are reclaiming 1.1.1.1 for some useful purpose. CloudFlare can take the traffic load of all the horribly misconfigured systems in the world. APNIC can use this setup to do some analytics work to find out exactly how screwed up things are. I wouldn’t be shocked to see something similar happen to 1.2.3.4 in the future if this bet pays off.

I’ve been using 1.1.1.1 since April 2nd and it works. It’s fast and hasn’t broken yet, which is the best that you can hope for from a DNS server. I’m sure I’ll play around with some of the advanced features as they come online but for now I’m just happy that one of the most recognizable IP addresses in the world is working for me.

Wireless Thoughts From Aruba Atmosphere

I just got back from Aruba Atmosphere this week and I thought it would be a good chance to go over some of the cool stuff that I saw there.

  • Rasa is now Aruba NetInsights. That platform is going to be a big one for Aruba in the future. There’s a lot of information that is being gleaned from installations and it’s fueling some hard looks at best practices and such. Also funny that it’s being installed primarily in university campuses to profile coverage and client capabilities. Those are usually pretty hostile environments for users and administrators alike.
  • The security pieces that were shown off were also very interesting. The idea of port profiles has always made me a bit skeptical, but the way that Aruba is doing actual traffic profiling makes me think they have it this time. It’s also really cool that it can be done with non-managed devices in the middle. I think the key is that Aruba is doing actual traffic profiling instead of just looking at the basics behind the packets, like ports or VLANs. Real, automatic port security could be a huge win for places that need on-the-fly access to rapidly changing conditions. Like, say, university campuses.
  • Live demos rock when your technology is solid enough that you don’t have to worry about it blowing up on stage. We’re past the point of the wireless network blowing up the iPhone 4 for Steve Jobs. In fact, the best part was that the demo was slower because the demo guy had an extra espresso shot and the camera couldn’t hold still!

Look for some interesting discussion from Tech Field Day and stay tuned to hear more about Atmosphere from the delegates!

Is Patching And Tech Support Bad? A Response

Hopefully, you’ve had a chance to watch this 7 minute video from Greg Ferro about why better patching systems can lead to insecure software. If you haven’t, you should:

Greg is right that moral hazard is introduced because, by definition, the party providing the software is “insured” against the risks of the party using the software. But, I also have a couple of issues with some of the things he said about tech support.

Are You Ready For The Enterprise

I’ve been working with some Ubiquiti access points recently. So far, I really enjoy them and I’m interested to see where their product is going. After doing some research, the most common issue with them seems to be their tech support offerings. A couple of Reddit users even posted in a thread that the lack of true enterprise tech support is the key that is keeping Ubiquiti from reaching real enterprise status.

Think about all the products that you’ve used over the last couple of years that offered some other kind of support aside from phone or rapid response. Maybe it was a chat window on the site. Maybe it was an asynchronous email system. Hell, if you’ve ever installed Linux in the past on your own you know the joys of heading to Google to research an issue and start digging into post after post on a long-abandoned discussion forum. DenverCoder9 would agree.

By Greg’s definition, tech support creates negative incentive to ship good code because the support personnel are there to pick up the pieces for all your mistakes when writing code. It even incentivizes people to set unrealistic goals and push code out the door to hit milestones. On the flip side, try telling people your software is so good that you don’t need tech support. It will run great on anything every time and never mess up once. If you don’t get laughed out of the building you should be in sales.

IT professionals are conditioned to need tech support. Even if they never call the number once they’re going to want the option. In the VAR world, this is the “throat to choke” mentality. When the CEO is coming down to find out why the network is down or his applications aren’t working, they are usually not interested in the solution. They want to yell at someone to feel like they’ve managed the problem. Likewise, tech support exists to find the cause of the problem in the long run. But it’s also a safety blanket that exists for IT professionals to have someone to yell at after the CEO gets to them. Tech Support is necessary. Not because of buggy code, but because of peace of mind.

How Complex Are You?

Greg’s other negative aspect to tech support is that it encourages companies to create solutions more complex than they need to be because someone will always be there to help you through them. Having done nationwide tech support for a long time, I can tell you that most of the time that complexity is there for a reason. Either you hide it behind a pretty veneer or you risk having people tell you that your product is “too simple”.

A good example case here is Meraki. Meraki is a well-known SMB/SME networking solution. Many businesses run on Meraki and do it well. Consultants sell Meraki by the boatload. You know who doesn’t like Meraki? Tinkerers. People that like to understand systems. The people that stay logged into configuration dashboards all day long.

Meraki made the decision when they started building their solution that they weren’t going to have a way to expose advanced functionality in the system to the tinkerers. They either buried it and set it to do the job it was designed to do or they left it out completely. That works for the majority of their customer base. But for those that feel they need to be in more control of the solution, it doesn’t work at all. Try proposing a Meraki solution to a group of die-hard traditionalist IT professionals in a large enterprise scenario. Odds are good you’re going to get slammed and then shown out of the building.

I too have my issues with Meraki and their ability to be a large enterprise solution. It comes down to choosing to leave out features for the sake of simplicity. When I was testing my Ubiquiti solution, I quickly found the Advanced Configuration setting and enabled it. I knew I would never need to touch any of those things, yet it made me feel more comfortable knowing I could if I needed to.

Making a solution overly complex isn’t a design decision. Sometimes the technology dictates that the solution needs to be complex. Yet, we knock solutions that require more than 5 minutes to deploy. We laud companies that hide complexity and claim their solutions are simple and easy. And when they all break we get angry because, as tinkerers, we can’t go in and do our best to fix them.


Tom’s Take

In a world where software in king, quality matters. But, so does speed and reaction. You can either have a robust patching system that gives you the ability to find and repair unforeseen bugs quickly or you can have what I call the “Blizzard Model”. Blizzard is a gaming company that has been infamous for shipping things “when they’re done” and not a moment before. If that means a project is 2-3 years late then so be it. But, when you’re a game company and this is your line of work you want it to be perfect out of the box and you’re willing to forego revenue to get it right. When you’re a software company that is expected to introduce new features on a regular cadence and keep the customers happy it’s a bit more difficult.

Perhaps creating a patching system incentivizes people to do bad things. Or, perhaps it incentivizes them to try new things and figure out new ideas without the fear that one misstep will doom the product. I guess it comes down to whether or not you believe that people are inherently going to game the system or not.

When Redundancy Strikes

Networking and systems professionals preach the value of redundancy. When we tell people to buy something, we really mean “buy two”. And when we say to buy two, we really mean buy four of them. We try to create backup routes, redundant failover paths, and we keep things from being used in a way that creates a single point of disaster. But, what happens when something we’ve worked hard to set up causes us grief?

Built To Survive

The first problem I ran into was one I knew how to solve. I was installing a new Ubiquiti Security Gateway. I knew that as soon as I pulled my old edge router out that I was going to need to reset my cable modem in order to clear the ARP cache. That’s always a thing that needs to happen when you’re installing new equipment. Having done this many times, I knew the shortcut method was to unplug my cable modem for a minute and plug it back in.

What I didn’t know this time was that the little redundant gremlin living in my cable modem was going to give me fits. After fifteen minutes of not getting the system to come back up the way that I wanted, I decided to unplug my modem from the wall instead of the back of the unit. That meant the lights on the front were visible to me. And that’s when I saw that the lights never went out when the modem was unplugged.

Turns out that my modem has a battery pack installed since it’s a VoIP router for my home phone system as well. That battery pack was designed to run the phones in the house for a few minutes in a failover scenario. But it also meant that the modem wasn’t letting go of the cached ARP entries either. So, all my efforts to make my modem take the new firewall were being stymied by the battery designed to keep my phone system redundant in case of a power outage.

The second issue came when I went to turn up a new Ubiquiti access point. I disconnected the old Meraki AP in my office and started mounting the bracket for the new AP. I had already warned my daughter that the Internet was going to go down. I also thought I might have to reprogram her device to use the new SSID I was creating. Imagine my surprise when both my laptop and her iPad were working just fine while I was hooking the new AP up.

Turns out, both devices did exactly what they were supposed to do. They connected to the other Meraki AP in the house and used it while the old one was offline. Once the new Ubiquiti AP came up, I had to go upstairs and unplug the Meraki to fail everything back to the new AP. It took some more programming to get everything running the way that I wanted, but my wireless card had done the job it was supposed to do. It failed to the SSID it could see and kept on running until that SSID failed as well.

Finding Failure Fast

When you’re trying to troubleshoot around a problem, you need to make sure that you’re taking redundancy into account as well. I’ve faced a few problems in my life when trying to induce failure or remove a configuration issue was met with difficulty because of some other part of the network or system “replacing” my hard work with a backup copy. Or, I was trying to figure out why packets were flowing around a trouble spot or not being inspected by a security device only to find out that the path they were taking was through a redundant device somewhere else in the network.

Redundancy is a good thing. Until it causes issues. Or until it makes your network behave in such a way as to be unpredictable. Most of the time, this can all be mitigated by good documentation practices. Being able to figure out quickly where the redundant paths in a network are going is critical to diagnosing intermittent failures.

It’s not always as easy as pulling up a routing table either. If the entire core is down you could be seeing traffic routing happening at the edge with no way of knowing the redundant supervisors in the chassis are doing their job. You need to write everything down and know what hardware you’re dealing with. You need to document redundant power supplies, redundant management modules, and redundant switches so you can isolate problems and fix them without pulling your hair out.


Tom’s Take

I rarely got to work with redundant equipment when I was installing it through E-Rate. The government doesn’t believe in buying two things to do the job of one. So, when I did get the opportunity to work with redundant configurations I usually found myself trying to figure out why things were failing in a way I could predict. After a while, I realized that I needed to start making my own notes and doing some investigation before I actually started troubleshooting. And even then, like my cable modem’s battery, I ran into issues. Redundancy keeps you from shooting yourself in the foot. But it can also make you stab yourself in the eye in frustration.

Cisco Live CAE and Guest Keynote Announcements

As you may have heard by now, there have been a few exciting announcements from Cisco Live 2018 regarding the venue for the customer appreciation event and the closing keynote speakers.

Across The Universe

The first big announcement is the venue for the CAE. When you’re in Orlando, there are really only two options for the CAE. You either go to the House of the Mouse or you go to Universal Studios. The last two times that Cisco Live has gone to Orlando it has been to Universal. 2018 marks the third time!

Cisco is going big this year. They’ve rented the ENTIRE Universal Studios park. Not just the backlot. Not just the side parks. They WHOLE thing. You can get your fix on the Transformers ride, visit Harry Potter, or even partake of some of the other attractions as well. It’s a huge park with a lot of room for people to spread out and enjoy the scenery.

That’s not all. The wristband that gets you into the CAE also gets you access to Islands of Adventure before the full park opens! You can pregame the party by hanging out at Hogwarts, going to Jurassic Park, or joining your favorite superheroes for a picture or two for the kids. Access to Islands of Adventure isn’t exclusive, so you’ll be there with all the other tourists from around the world but it’s a great place to hang out before the party gets going!

Note that this year you will need the new Imagine pass or the Party Pass Add-on in order to access the CAE. There is no standalone social pass option or social add-on for conference passes.

Welcome To The Future

The closing keynote speakers have also been announced. Dr. Michio Kaku and Amy Webb will be on stage talking about the future of technology and how it will be impacting our society. Given the keynote that Rowan Trollope delivered during Cisco Live Barcelona, this comes as no surprise to me.

Cisco is very much trying to show that they are getting back on the leading edge of technology and driving innovation in the market. The problem with being the “800lb Gorilla” is that you’re also big and difficult to move. IBM faced the same problem before they shed their legacy and became leaner, more future-focused company. Others that tried to follow in their footsteps were less successful and either split apart or got scooped up in mergers.

Cisco is going through a transition period after the departure of John Chambers. Chuck Robbins is turning the ship as quickly as possible, but there need to be more outwards signs that things are being done to look toward a future where hardware isn’t as important as the innovation happening in software. By bringing in two of the most well known futurists in science and technology, Cisco is sending a signal to their audience of users and investors that the focus is going to be on emerging technology. This is a bit of a gamble for Cisco but it’s hoped that things pay off for them.

Note that there are also going to be other speakers in the Big Ideas Theater on the World of Solutions floor during the event. Access to the World of Solutions is restricted behind the new Imagine pass or full conference pass. There is no Social Pass option, and the party pass add-on does not grant access to the World of Solutions floor.


Tom’s Take

The Cisco Live CAE in Orlando is pretty much a known thing. It’s nice to see all of Universal this year with access to the new attractions at Islands of Adventure. People should be able to enjoy being outside in the Florida humidity instead of the blistering Las Vegas inferno. As well, the rides are going to be fun for a large number of the attendees.

It’s also good to see future-looking keynote speakers that are going to give their viewpoints on things that will impact our lives. With two speakers, I’m expecting another “interview” style closing keynote, which isn’t quite my favorite. But this is a step in the right direction. Here’s hoping that these additions to the event make Cisco Live a great show for those that will be attending.

Memcached DDoS – There’s Still Time to Save Your Mind

In case you haven’t heard, there’s a new vector for Distributed Denial of Service (DDoS) attacks out there right now and it’s pretty massive. The first mention I saw this week was from Cloudflare, where they details that they were seeing a huge influx of traffic from UDP port 11211. That’s the port used by memcached, a database caching system.

Surprisingly, or not, there were thousands of companies that had left UDP/11211 open to the entire Internet. And, by design, memcached responds to anyone that queries that port. Also, carefully crafted packets can be amplified to have massive responses. In Cloudflare’s testing they were able to send a 15 byte packet and get a 134KB response. Given that this protocol is UDP and capable of responding to forged packets in such a way as to make life miserable for Cloudflare and, now, Github, which got blasted with the largest DDoS attack on record.

How can you fix this problem in your network? There are many steps you can take, whether you are a system admin or a network admin:

  • Go to Shodan and see if you’re affected. Just plug in your company’s IP address ranges and have it search for UDP 11211. If you pop up, you need to find out why memcached is exposed to the internet.
  • If memcached isn’t supposed to be publicly available, you need to block it at the edge. Don’t let anyone connect to UDP port 11211 on any device inside your network from outside of it. That sounds like a no-brainer, but you’d be surprised how many firewall rules aren’t carefully crafted in that way.
  • If you have to have memcached exposed, make sure you talk to that team and find out what their bandwidth requirements are for the application. If it’s something small-ish, create a policer or QoS policy that rate limits the memcached traffic so there’s no way it can exceed that amount. And if that amount is more than 100Mbit of traffic, you need to have an entirely different discussion with your developers.
  • From Cloudflare’s blog, you can disable UDP on memcached on startup by adding the -U 0 flag. Make sure you check with the team that uses it before you disable it though before you break something.

Tom’s Take

Exposing unnecessary services to the Internet is asking for trouble. Given an infinite amount of time, a thousand monkeys on typewriters will create a Shakespearean play that details how to exploit that service for a massive DDoS attack. The nature of protocols to want to help make things easier doesn’t make our jobs easier. They respond to what they hear and deliver what they’re asked. We have to prevent bad actors from getting away with things in the network and at the system level because application developers rarely ask “may I” before turning on every feature to make users happy.

Make sure you check your memcached settings today and immunize yourself from this problem. If Github got blasted with 1.3Tbps of traffic this week there’s no telling who’s going to get hit next.