The Death of TRILL


wasteland_large

Networking has come a long way in the last few years. We’ve realized that hardware and ASICs aren’t the constant that we could rely on to make decisions in the next three to five years. We’ve thrown in with software and the quick development cycles that allow us to iterate and roll out new features weekly or even daily. But the hardware versus software battle has played out a little differently than we all expected. And the primary casualty of that battle was TRILL.

Symbiotic Relationship

Transparent Interconnection of Lots of Links (TRILL) was proposed as a solution to the complexity of spanning tree. Radia Perlman realized that her bridging loop solution wouldn’t scale in modern networks. So she worked with the IEEE to solve the problem with TRILL. We also received Shortest Path Bridging (SPB) along the way as an alternative solution to the layer 2 issues with spanning tree. The motive was sound, but the industry has rejected the premise entirely.

Large layer 2 networks have all kinds of issues. ARP traffic, broadcast amplification, and many other numerous issues plague layer 2 when it tries to scale to multiple hundreds or a few thousand nodes. The general rule of thumb is that layer 2 broadcast networks should never get larger than 250-500 nodes lest problems start occurring. And in theory that works rather well. But in practice we have issues at the software level.

Applications are inherently complicated. Software written in the pre-Netflix era of public cloud adoption doesn’t like it when the underlay changes. So things like IP addresses and ARP entries were assumed to be static. If those data points change you have chaos in the software. That’s why we have vMotion.

At the core, vMotion is a way for software to mitigate hardware instability. As I outlined previously, we’ve been fixing hardware with software for a while now. vMotion could ensure that applications behaved properly when they needed to be moved to a different server or even a different data center. But they also required the network to be flat to overcome limitations in things like ARP or IP. And so we went on a merry journey of making data centers as flat as possible.

The problem came when we realized that data centers could only be so flat before they collapsed in on themselves. ARP and spanning tree limited the amount of traffic in layer 2 and those limits were impossible to overcome. Loops had to be prevented, yet the simplest solution disabled bandwidth needed to make things run smoothly. That caused IEEE and IETF to come up with their layer 2 solutions that used CLNS to solve loops. And it was a great idea in theory.

The Joining

In reality, hardware can’t be spun that fast. TRILL was used as a reference platform for proprietary protocols like FabricPath and VCS. All the important things were there but they were locked into hardware that couldn’t be easily integrated into other solutions. We found ourselves solving problem after problem in hardware.

Users became fed up. They started exploring other options. They finally decided that hardware wasn’t the answer. And so they looked to software. And that’s where we started seeing the emergence of overlay networking. Protocols like VXLAN and NV-GRE emerged to tunnel layer 2 packets over layer 3 networks. As Ivan Pepelnjak is fond of saying layer 3 transport solves all of the issues with scaling. And even the most unruly application behaves when it thinks everything is running on layer 2.

Protocols like VXLAN solved an immediate need. They removed limitations in hardware. Tunnels and fabrics used novel software approaches to solve insurmountable hardware problems. An elegant solution for a thorny problem. Now, instead of waiting for a new hardware spin to fix scaling issues, customers could deploy solutions to fix the issues inherent in hardware on their own schedule.

This is the moment where software defined networking (SDN) took hold of the market. Not when words like automation and orchestration started being thrown about. No, SDN became a real thing when it enabled customers to solve problems without buying more physical devices.


Tom’s Take

Looking back, we realize now that building large layer 2 networks wasn’t the best idea. We know that layer 3 scales much better. Given the number of providers and end users running BGP to top-of-rack (ToR) switches, it would seem that layer 3 scales much better. It took us too long to figure out that the best solution to a problem sometimes takes a bit of thought to implement.

Virtualization is always going to be limited by the infrastructure it’s running on. Applications are only as smart as the programmer. But we’ve reached the point where developers aren’t counting on having access to layer 2 protocols that solve stupid decision making. Instead, we have to understand that the most resilient way to fix problems is in the software. Whether that’s VXLAN, NV-GRE, or a real dev team not relying on the network to solve bad design decisions.

17 thoughts on “The Death of TRILL

  1. An interesting and related discussion would be about the relationship between the server silo (including server virtualization) and the network silo (including the access control and multitenancy provided by modern networks).

    The real tug of war over VXLAN had to do with which silo controls what aspect of access control and security, in particular at the point where a VM meets a vSwitch. By choosing an L2-over-L3 overlay network to enable clean vMotion, VXLAN enabled a server virtualization centric solution to (at least parts of) the access control and multitenancy problems. The need for large L2 was eliminated as a consequence of the L3 based design. TRILL and SPB/PBB were just collateral damage.

    This tug of war is not over. Cisco/Insieme built a network-centric design using VXLAN. Calico (and its evolution this week) is a server centric design built over a dumb L3 network. Facebook’s Open/R this week, while on the surface unrelated, allows aspects of a distributed application to be assimilated into the network’s L3 routing.

    The long term answer, of course, is dissolving the boundary between the server and the network, so that instead of competing for control of various aspects of the solution (and creating all sorts of unnecessary complexity at the boundary between server and network), there is one seamless solution. (No, Mr. IPTables, you cannot do it all.) (No, Mr. network control plane, you cannot do it all, either.) (No, buying separate server and network technologies which come from two different groups of people in the same company isn’t what I’m talking about here.) Will be interesting to see what emerges, erasing these silo lines, over the next decade or two.

  2. One other point — part of the reason is because TRILL tried to boil the ocean, rather than taking on a small set of problems and building up over time as other requirements were proven to be “real.” This is becoming a common event in standards, for various reasons…

  3. I agree with the statemant that TRILL is death. No vendor has picked up TRILL and made it a standard design for their products. So their was not much development and innovation for TRILL lately.
    I disagree with the same statement for SPB. SPB has evolved and addressed besides the L2 scalability also many other problems. Most of the benefits that are addressed by an overly solution like VXLAN are also natively integrated into SPB. Here are some examples. You see in a SPB fabric each virtual network as a service no matter if it is a layer2 or layer3 service, distributed Layer3 gateway, multicast routing, only configuration point is the access ect….
    I see value of having an elegant way to achieve an automated network. Of course that can also be done with an overlay approach , but you will always have an extra level of complexity. If you can achieve the same result without an extra overlay it helps also for troubleshooting.
    And at the end of the day even the best overlay network is still depending on a reliable underlay. All software defined magic is still running on hardware.

    • Biggest problems there is that ONLY the big players have implemented SPB.

      There is no solution for anyone trying to solve similar problems down at the IoT/M2M level unless it involves using Avaya or similar…

      At least with TRILL, I can TRY to implement it on an embedded Linux system.

      Everybody and his damn dog is claiming it’s ‘dead’ and the like…and it’s MUCH more applicable than to enterprise systems, much like SPB is…but NOBODY’s DONE IT down at any real level except for the class you lot are describing.

      Simply put…it’s not a damn answer if i can’t friggin’ use it, I don’t give a tinker’s damn if it’s a “standard” or not.

      • SPBm does address everything Ethernet at the L2 and L3 level, unicast and multicast. You not need to implement in the OS. I had the same questions as well. And after talking with some of the SPB engineers, I agree, it’s an extra layer that is not needed. Do you have specific reason you want to extend an ISID to the OS?

        SPB is in data centers and the base is growing every day. There are 2 and 3 generations of this networking already as well.

        One interesting story is that those that put it in never take it out, specifically with multicast networks. Nothing else works in a converged environment.

        • “SPBm does address everything Ethernet at the L2 and L3 level, unicast and multicast.” You’re damned right it does. It also solves many multicast issues that are not related to the switches, but to the devices hanging off the switches. for example, Several IP CCTV manufacturers are still producing IP based cameras, encoders, etc that do not meet the complete multicast standard. only part of it. Avaya/Extreme’s implementation terrific as it solved the problems with the manufacturers I have had to deal with. SPBm is a success story for me.
          And we are fully converged, damned shame Juniper and Cisco did not jump on the SPBm band wagon.

          Just stating my experience, Nothing more

          I believe one of the Avaya sales guys ( take this with a grain of salt ) told me they hit 35 K streams in the VSP series gear, and the increase was 2 % points more.

  4. Pingback: BGP: The Application Networking Dream | The Networking Nerd

  5. Pingback: Worth Reading: The death of TRILL - 'net work

  6. I have to agree with Dominik here, heavily in fact. I will go even further though, SPBm (NOT v) actually solves a HUGE network problem, multicast. SPBm can handle many-to-many multicast traffic on the order of 15K plus streams on very little hardware with very little CPU overhead. In fact in testing, they have not found a limit yet but I think 15K is good enough number in my book. Multicast is a great idea that normally cannot be implemented easily by any vendor and even when it is finally implemented, it is usually fragile. SPBm can support it even in a multi-tenant environment SECURLY. SPBm was designed by people who made MPLS but decided they wanted to fix the loop issues and configuration mess.

    If someone is interested in SPB, give us a call, 844-993-7726, x710 to reach me.

      • You have to write a book to post here? 😉 I only find interest growing more every day. In fact someone reading here called me asking for more information. I did lose their information but posting here what I meant to send to him.

        This might provide you with a good base for SPB, there are other links you can follow here too. IEEE 802.1aq is the standard.

        Ed Koehler posts many videos on SPB and various topics surrounding SPB:
        https://www.youtube.com/channel/UCn8AhOZU3ZFQI-YWwUUWSJQ

        There are many things I didn’t even get to like IP Shortcuts, something VMware has been trying to reinvent with NSX and their API. This allows dynamic edge routing so traffic never enters the core.

        Also, the cores never see the edge MACs, this is only known on the edge switch.

        -Walt

  7. Dominik and Walter are right about SPM-M
    It’s “macgyver” protocol it’s takes the common standard and makes something better…
    No stp
    no 4K vlan
    No PIM out of the box multicast
    MPLS/VPLS service Multi tenement feathers
    Simple configuration edge only

    The only problem is that the major vendor that push the protocol is not Cisco or Juniper…
    I get them PBB/SPB M on enterprise hardware is bad for carrier…

  8. 2018 has come. Is TRILL (FP, VCS) dead? Seems like the small shops with only a few hundred nodes / a handful of access switches were experiencing the biggest benefits. FabricPath’s greatest sell-point is that it is EASY to configure & manage. It was truly plug-n-play, and scaled fairly well. I have clients abandoning FabricPath in favor of VXLAN w/EVPN, not because it’s better, but because they have lost vendor support on L2 fabrics.

  9. Pingback: Does SPB Mean "Secure Path Bridging"? - Gestalt IT

  10. Pingback: The importance of Calico's pluggable data plane - Tigera

  11. Pingback: Sự tiến hóa của kiến trúc mạng doanh nghiệp & Datacenter (Phần 2) - Myon.edu.vn

Leave a comment