Distributed Router Fabrics

Vasilenko_Eduard · December 24, 2024, 4:49pm

Every port has 2 costs associated with it: port itself and optical pluggable. Historically, the proportion was even bigger than 10:1, but optical people did manage to preserve their margin better.

I did not try to calculate for a couple of years, but before the cost of a router port was comparable to single-mode pluggable, multi-mode was cheaper, and copper was even cheaper.

(it was funny to see that a discount was asked from the networking vendor, but the biggest payment was going to the optical vendor, especially for switches)

2 hops though CLOS architecture is 4 ports and 4 pluggables – all are typically (in Telco) single-mode for unification.

In the case of a chassis, the cross-bar is electrical and costs very little money.

Actually, the chassis-based router has 4 ports too on the traffic path, but single-mode pluggable is only on external ports, internal ports are cost-compared to copper pluggable.

Hence, the chassis-based router has a natural advantage: less cost on SFP/QSFP/etc. Strictly speaking, pluggable is not a networking vendor’s business, especially for new high-speed interfaces.

The actual situation is that pizza-box (even from a respected vendor, where all features are available) is cheaper than chassis-based (per port) is attributed to non-technical reasons, primarily bigger competition.

Pay attention that “Modern Switch” is just a package for “Optical Pluggables”.

Hence, DC people are making very strange designs (with a lot of compromises, typically oversubscription) to downgrade multi-mode pluggable to copper pluggable. Then claim it as a big achievement. The money is there.

PS: I agree (with Tom) that the feature list is important even for the pizza box.

Eduard

Mike_Hammett · December 24, 2024, 7:38pm

“what benefits is OP seeing here when it comes to pizzabox”

I’m more learning and questioning than stating. I’ve thoroughly enjoyed the thread.

One of the main advantages I saw from the outset was that I could start with a single box and then grow if needed. Other than recabling, if not planned for accordingly, it seems like I can still do that. You would have an increased cost once you had to add a fabric box, but you’ve already had some amount of scale to get there. With a chassis system, you have the larger cost up front before you even know how you’re going to scale. It’s more difficult to plan what sized solution and no matter what you do, you’ll probably pick the wrong one.

Mike_Hammett · December 24, 2024, 7:41pm

“Differences in ASICs, buffers, etc can really create traffic problems if you mix wrong”

This is why I liked to create this thread. In the information I’ve read so far, the marketing speak was more or less that it worked and to just move on. It’s good to learn that there are caveats that need to be explored.

“makes your head hurt how much overcomplication”

Aren’t there memes about Silicon Valley re-inventing things we already have in a more complicated and cumbersome way?

swlaemmr · December 24, 2024, 8:06pm

It has been an interesting discussion. Always willing to see what others are doing in this space and evaluate it to what we’re doing / thinking about going forward.

We were just quoted 250k+ for a Cisco ASR9902 with one route processor card (list. not our price). There can be 2 rps, but they don’t talk to each other – who thought that was a good idea. We were just looking for something that could take a couple of full tables and had 2 - 4 100g connections.

We’re still using ASR920s (12 10g ports, rather uncommon) at most of our pops but have seen the writing on the wall that 10g will not be enough at some point and 100g will be necessary.

I did have an interesting conversation with Ribbon (we have a C15 phone switch) about their Neptune platform. Surprisingly affordable when you look at what Cisco is charging. Though they didn’t have as many 100g interfaces as I’d like – can’t see using them as a BGP speaking router, but for internal transport stuff, it was definitely attractive.

Again, always nice to see what others are using / considering for similar stuff. Too often all we hear about are the ‘really big guys’ and how they’re deploying X for 400g now, etc.

Shawn

Phil · December 24, 2024, 9:00pm

All major vendors espouse both chassis and fabrics depending on what you are doing. I’m typically more of a fan of fabric based models but as others mentioned depends what you are doing. When I say fabric I mean something using Ethernet and a standard control plane, not proprietary interconnects and fabric encapsulation which is what you see in some BRCM based solutions. Those are basically virtual/multi-chassis systems, which also have their own pros and cons. What vendor/box you choose is mostly dependent on the feature set you need.

There is a giant list of pros and cons between traditional chassis and distributed “fabrics” but here are a few.

Management and control plane scale can be an issue until that gets figured out. Doing a 1:1 replacement of traditional large chassis with fabrics can add a lot of routers to a network.

Power depends on the chassis and power distribution design. However, you can build a fabric as you need, it doesn’t require the power day one like a traditional chassis.

Upgrading chassis switch fabrics and moving/mixing generations of line cards is almost always a painful experience.

Phil

Tony_Wicks · December 24, 2024, 9:39pm

>From: NANOG nanog-bounces+tony=wicks.co.nz@nanog.org On Behalf Of Shawn L via NANOG
>Sent: Wednesday, 25 December 2024 9:07 am
>To: NANOG nanog@nanog.org
>Subject: Re: Distributed Router Fabrics

It has been an interesting discussion. Always willing to see what others are doing in this space and evaluate it to what we’re doing / thinking about going forward.

We were just quoted 250k+ for a Cisco ASR9902 with one route processor card (list. not our price). There can be 2 rps, but they don’t talk to each other – who thought that was a good idea. We were just looking for something that could take a couple of full tables and had 2 - 4 100g connections.

We’re still using ASR920s (12 10g ports, rather uncommon) at most of our pops but have seen the writing on the wall that 10g will not be enough at some point and 100g will be necessary.

Nokia 7750-sr2s would do what you want at a much better price point I would suggest.

I did have an interesting conversation with Ribbon (we have a C15 phone switch) about their Neptune platform. Surprisingly affordable when you look at what Cisco is charging. Though they didn’t have as many 100g interfaces as I’d like – can’t see using them as a BGP speaking router, but for internal transport stuff, it was definitely attractive.

Again, always nice to see what others are using / considering for similar stuff. Too often all we hear about are the ‘really big guys’ and how they’re deploying X for 400g now, etc.

Shawn

Matthew_Petach2 · December 25, 2024, 3:31am

All major vendors espouse both chassis and fabrics depending on what you are doing. I’m typically more of a fan of fabric based models but as others mentioned depends what you are doing. When I say fabric I mean something using Ethernet and a standard control plane, not proprietary interconnects and fabric encapsulation which is what you see in some BRCM based solutions. Those are basically virtual/multi-chassis systems, which also have their own pros and cons. What vendor/box you choose is mostly dependent on the feature set you need.

There is a giant list of pros and cons between traditional chassis and distributed “fabrics” but here are a few.

Management and control plane scale can be an issue until that gets figured out. Doing a 1:1 replacement of traditional large chassis with fabrics can add a lot of routers to a network.

Power depends on the chassis and power distribution design. However, you can build a fabric as you need, it doesn’t require the power day one like a traditional chassis.

Upgrading chassis switch fabrics and moving/mixing generations of line cards is almost always a painful experience.

Phil

Power is a huge part of the equation that I think many people overlook.
When you look at what a really big chassis takes in terms of power feeds,
it’s not uncommon to need relatively specialized 3-phase 240V power feeds
for the very-high-end chassis box that give you the same type of high speed
port densities that a pizza-box fabric folded Clos model can yield.
(not to pick on any vendor, but here’s an example of the types of power feeds
a large chassis can require:)

"

AC Power Distribution Modules (PDMs)

|

The (REDACTED MODEL #) supports connection of a single-phase or three-phase (delta or wye) AC PDM. Four AC PDM models are available: three-phase delta, three-phase wye, seven-feed single-phase, and nine-feed single-phase.

- Each three-phase AC PDM requires two three-phase feeds to be connected. Each phase from each of the two feeds is distributed among one or two PSMs. One feed has each phase going to two PSMs, and the other feed has each phase going to a single PSM.

- The single-phase AC PDM provides an AC input connection from the single-phase AC power source, and also provides an input power interface to the PSM through a system power midplane. The single-phase AC PDMs accept seven or nine AC power cords from a single-phase AC source.

- Each AC input is independent and feeds one PSM. Up to nine PSMs can be connected through the AC PDM.

|

| - |

Generally speaking, you’re getting a licensed electrician to run three-phase power feeds
for them, you’re not going to just ask for a couple more outlets from the colo provider, and
the chassis listed above takes 4 power distribution modules each with two 3-phase AC feeds
for a total of 8 3-phase AC connections, 4 primary and 4 secondary feeds. That’s a lot of
custom electrical work to feed your chassis, not to mention 2x12KW of provisioned power.

By comparison, you can get a similar amount of port density and fabric throughput with
a folded-Clos design using 1RU 24 port 400G rack switches which each require a redundant
10amp, 120V power feed; absolutely standard, your normal rack PDU handles them quite
well. Start with the 12 switches for your spine, add leaf switches as needed to scale up,
and by the time you’ve hit the same leaf port density as the big chassis box, you’ve only
provisioned up half the power as the big chassis box. Over time, that difference in
provisioned power can make a huge difference in operational costs.

At this point, I’d be hard-pressed to find a reason to support recommending a big single
chassis solution to anyone other than an enterprise customer that wants to outsource
most of its network support needs to a vendor. In that model, yes, the one big chassis
model can make sense. But for everyone else, it’s seriously time to look at the scalability
and operational cost benefits of clustered pizza boxes.

Thanks!

Matt

Saku_Ytti1 · December 25, 2024, 8:15am

Thank you for sharing that Mike, however I was curious about the specific case of LACP OP stated.

Saku_Ytti1 · December 25, 2024, 8:30am

This and most of the differences are implementation details, not fundamental.

If we assume some fantastical world, where you can make anything
appear out of thin air with no cost, no one will use the same
interfaces to connect customers and to build interconnect between
chips. Because the serdes fabric interfaces are lower power, pin,
thermal, cost than the real customer facing port and you can have
higher density of them.

On very high and impractical level, that's the difference, how do you
interconnect chips, do you use some specialised solution that
understands that both ends are going to be the chips in the same rack
or do you use generic solution that makes no assumption on what is
going to be connected on the other end.

In practice people do these stack of switches, because they want
rightsized platforms and there isn't quite the right size available
from anyone that they care to deploy. So the rightsized stack ends up
being commercially more viable and more energy efficient, because the
right box for the application has no commercial availability.

Luckily for most of us, these problems do not matter, as chip
densities are front-running even most hyperscalers, when Amazon
presented their stack-of-switches solutions couple years ago, giving
the densities and how many front-facing ports are 'wasted' on internal
interconnect, vendors were already shipping single chip solutions
matching the stack-of-switches non-interconnect port densities, i.e.
already this hyperscaler solution could have been single chip device,
not stack-of-switches, not fabric box.
And this is true for almost every buyer in the market, you need just a
single chip box today, densities are absurd for almost everyone.

Mike_Hammett · December 26, 2024, 9:48pm

I don’t find this explained in any of the literature I’ve looked at so far.

In a distributed fabric, where is the traditional control plane run? Say I’ve got 100 BGP sessions of upstream,peer, and downstream across ten routers. Is each pizza box grinding this out on its own, or is the work done on the x86 box mentioned in the larger installations? If each box is doing it on its own, are there route reflectors somewhere making all of the decisions?

Bandy_Rush1 · December 26, 2024, 10:46pm

In a distributed fabric, where is the traditional control plane run?
Say I've got 100 BGP sessions of upstream,peer, and downstream across
ten routers. Is each pizza box grinding this out on its own, or is the
work done on the x86 box mentioned in the larger installations?

one way to think of it is that each pizza box (customer facing ports)
recognizes control plane messages (e.g. port 179) and "punts" them to
the control plane box, aka routing engine.

randy

Bandy_Rush1 · December 26, 2024, 10:51pm

In a distributed fabric, where is the traditional control plane run?
Say I've got 100 BGP sessions of upstream,peer, and downstream across
ten routers. Is each pizza box grinding this out on its own, or is the
work done on the x86 box mentioned in the larger installations?

one way to think of it is that each pizza box (customer facing ports)
recognizes control plane messages (e.g. port 179) and "punts" them to
the control plane box, aka routing engine.

fwiw, that is pretty much what line cards on a big-box fabric do, punt
to the RE.

randy

Joel_Jaeggli · December 26, 2024, 11:16pm

In a distributed fabric, where is the traditional control plane run?
Say I've got 100 BGP sessions of upstream,peer, and downstream across
ten routers. Is each pizza box grinding this out on its own, or is the
work done on the x86 box mentioned in the larger installations?

one way to think of it is that each pizza box (customer facing ports)
recognizes control plane messages (e.g. port 179) and "punts" them to
the control plane box, aka routing engine.

This is similar to the way I think about it.

In a router(switch) with a bunch of linecards (1 or more) there are a set of match rules (ACLs if you will) which match traffic bound for the control-plane and forward them via a management port to the control plane cpu, conveniently this is also where you implement your control-plane-protection. If you substitute ethernet switch for line-card, broadly you are an in a similar place conceptually if the main control plane processor is at some remove from the switch that is now a line card. at some point you need to encapsulate the control-plane messages because the managment next-hop is remote rather than local and the neighbor is also a switch.

For me, the realization a decade ago was that enclosing a large number of asics in sheet metal with the assiciated midplane and glue logic was a higher capital risk and reduced the size of the addressable market vs smaller switches with could be assembled piecemeal into a larger switch. The white box vendors and the ODMs are loath to build something that has limited market addressability so it becomes more attractive over time to build large assemblies of box rather than large boxes that can leverage the atom of switch that the 1/2ru pizza box can enclose and just buy more of them. That said the maximum radix of a single large switch asic right now is 512 * 100G ports. so you need to be able to build a box that can enclose that many ports or the 64 x 800G, that, that maps to which is still a very hefty pizza box.

Mike_Hammett · December 27, 2024, 2:27am

nods Yeah, I knew that’s how a traditional chassis worked. In a distributed setup, you have the option for a single “line card”, which obviously doesn’t happen in the traditional chassis world.

I do see in a DDCv2 document where they briefly mention 2 compute boxes, so now that makes sense. I had to look up some of the acronyms because the document didn’t define them within itself.

Tom_Beecher · December 27, 2024, 3:26am

Again, it depends.

DFs at the edge as you’re talking about are tricky. We worked on some designs a couple years ago. FIB management can become really tricky, with a lot of big peers and/or connections to the DFZ. If you do it wrong you can get tricky hotspotting or bouncing issues with your N/S traffic.

It’s doable of course, but in many circumstances I think these make the most sense down in the aggregation layers of a design.

Christopher_Morrow · January 1, 2025, 8:30pm

In the articles I’ve read and videos I’ve watched, they have mentioned varying amounts of reduced power. I didn’t commit them to memory because that wasn’t the part I was interested in at the moment.

I’d think that, especially as data rates climb, the power consumption is going to really get important fast.
When a single device requires ~50kw to run … I think you’ll want to make sure you have space/power to deal with that

I’m not sure that distributed fabric plans make that problem better? (maybe it’s all the same problem in the end because the fabric interconnect is
going to be distance limited/etc too)

Management of the things is a big thing I’ve been concerned about going into more modern systems. So often there’s hand waiving regarding the orchestration piece of non-traditional systems. From what I’ve seen (and I would love to be wrong), you either build it in-house (not a small lift) or you buy something that ends up taking away all of the cost advantages that path had.

You almost certainly get into (pretty quickly) something that smells a bunch like:
“here’s my pile of ansible recipes for…”
(choice of ansible here for example only, s/ansible// of course to whatever you feel like)

That’s maybe fine if that’s your jam? I think it’s hard to reason/plan/build without some automation plan ‘now’,
and it looks like a ton of folk start without that then try to retrofit once: “omg this is very large now… ugh” happens.
(1-10 devices? sure fine do it by hand, 5-> you really ought to have had an automation plan at ~5 … my opinion clearly)

Failure domain stuff is part of what I’m trying to learn more about, which goes back to more about the fundamentals of how the fabric works.

yea… This part(reasoning about failure domains) I assume is also a tad hard.
A scenario is:
“I built this 200tb fabric, I interconnect to the outside with ~100T max and internally with ~100T”
now that ~100T breaks and (ideally!) everything on the outside re-routes around to a different front-door… oops are you prepared for an extra ~100T arriving?
How do you deal with parts (fabric parts) failing in part? “oops only 50T of my 100T can get through here and … I also am still telling my external neighbors all’s good”

Really that failure-domain problem is tightly linked to the ‘manage a ton of things’ problem too… at least for containing damage in a quick manner.

Christopher_Morrow · January 1, 2025, 8:39pm

(sorry I should have also mentioned one other thing below)

In the articles I've read and videos I've watched, they have mentioned varying amounts of reduced power. I didn't commit them to memory because that wasn't the part I was interested in at the moment.

I'd think that, especially as data rates climb, the power consumption is going to really get important fast.
When a single device requires ~50kw to run ... I think you'll want to make sure you have space/power to deal with that

I'm not sure that distributed fabric plans make that problem better? (maybe it's all the same problem in the end because the fabric interconnect is
going to be distance limited/etc too)

One thing to really think about here is:
"What is the traffic pattern you expect to see?"

I mean here:
  * "I have jewels in the center of my network, and everyone comes
through the edge to the jewel(s), and then back out"
    - "just have a ton of pizza boxes, all traffic is edge-to-core
and I'm not spending optics/etc in the local edge area bouncing
traffic around"
  * "I have no jewels, I'm providing any-to-any connectivity, folk
might just bounce traffic through me and right back out the same
location to someone else"
    - "oops, now I need to spray traffic in the local pop/metro and
do not need as much core-facing capacity out of that local area"

anyway, interesting conversation