Distributed Router Fabrics

Mike_Hammett · December 20, 2024, 4:06pm

I've noticed that the whitebox hardware vendors are pushing distributed router fabrics, where you can keep buying pizza boxes and hooking them into a larger and larger fabric. Obviously, at some point, buying a big chassis makes more sense. Does it make sense building up to that point? What are your thoughts on that direction?

Yan_Filyurin · December 21, 2024, 2:48pm

When you say distributed router fabrics, are you thinking OCP concept with interconnect switch with ATM-like cell relay (after flowery speeches about “not betting against Ethernet”, or course)?

https://www.youtube.com/watch?v=l_hyZwf6-Y0
https://www.ufispace.com/company/blog/what-is-a-distributed-disaggregated-chassis-ddc

mostly advocated by Drivenets. It has been a while, but from what I remember, the argument, and it has a lot of merit, is you can scale to a lot bigger “chassis” than you could with any bigiron device. If you look at Broadcom latest interconnect specs https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/bcm88920, you can build a pretty big Pops, and while they are trying to appeal mostly to AI cluster crowd, one could build aggregation services with that, or something smaller and you get incremental scaling and possibly higher availability, since everything is separated and you could even get enough RPs for proper consensus. I admit, I have never seen it outside of lab environment, but AT&T appears to like it. Plus all the mechanics of getting through your fabric are still handled by the vendor and you manage it like a single node.

One could argue that with chassis systems, you can still scale incrementally, use different line card ports for access and aggregation and your leaf/interconnect is purely electrical, so you are not spending money on optics, so it does not exactly invalidate chassis setup and that is why every big vendor will sell you both, especially if you are not of AT&T scale.

There is of course the other design with normal Ethernet fabrics based on Fat Tree or some other topology with all the normal protocols between the devices, but then you are in charge of setting up, traffic engineering and scaling those protocols. IETF has done interesting things with these scaling ideas and some vendors may have even implemented them to the point that they work. But “too many devices” argument starts creeping in.

Yan

Mike_Hammett · December 21, 2024, 6:28pm

Yeah, UfiSpace is where I had first seen it, but then I saw it elsewhere.

Tom_Beecher · December 21, 2024, 6:33pm

It’s just tradeoffs.

Many of the benefits ( smaller failure domains, power savings , incremental expandability ) can be counterbalanced by increased operational complexity. From my experiences, if you don’t have proper automation/tooling for management/configuration and fault detection, it’s a nightmare. If you do have those things, then the benefits can be substantial.

I think every network will have a tipping point in which such a model starts to make more sense, but at smaller scales I think fat chassis are still likely a better place to be.

Mike_Hammett · December 21, 2024, 7:54pm

Oh, so you’re saying that small networks benefit more from a traditional chassis than a distributed fabric? I would have thought it the other way around in that you could start with a single pizza box, then add another appropriate to the need, then another appropriate to the need as opposed to trying to figure out if you needed a 4, 8, 13, 16, or 20 slot chassis and then end up either over (or under) buying.

I guess it also depends on one’s definition of small.

I guess it also depends on what tooling is available. So often, I see platforms offer a bunch of programability, but then no one commercially (or open source) provides that tooling - they expect you to build it yourself. Most anyone can sit down at XYZ chassis and figure it out, but if it’s obscure distributed system without centralized tooling, that could be tricky. Well, if you have more than a handful of boxes.

Yan_Filyurin · December 21, 2024, 9:52pm

Maybe more like medium, but if you know that you won’t grow beyond a certain size and growth trajectory, chassis would make life easier. If you are dealing with some compute and you know how many racks you have, same thing. In fact with small networks, you are actually starting out with more than what you need, since you have to install “line card” and “backplane” boxes. Plus Route Processors. So you are thinking of going beyond the capacity of a single pizza box (or half of device), you are starting with a chassis.

If you are going down the road of pizza boxes, it could be easier to standardize deployments to a single type of device. And not think about which chassis to buy.

Mike_Hammett · December 21, 2024, 10:25pm

True. Small networks would just have a single pizza box and call it a day.

I haven’t looked that deeply yet. I was assuming you could just start with a single pizza box and add more on as requirements matured. It certainly gets more complicated quickly if you can’t do that.

Tom_Beecher · December 21, 2024, 11:13pm

I haven’t looked that deeply yet. I was assuming you could just start with a single pizza box and add more on as requirements matured. It certainly gets more complicated quickly if you can’t do that.

Fabric, fabric, fabric.

For a modular chassis, the fabric capacity has to be sized so it’s fully non-blocking for a fully populated set of linecards. Even if you’re only using 2/10 slots, still needs to be sized for all 10.

For distributed chassis , you can get away with sizing your fabric stage for exactly what you need, but you still have to be aware of expansion, port allocations, etc. You normally would pre-plan out the max size , since you can’t easily just shuffle those ‘internal links’ around. You also tend to want to buy switches for the full fabric in one shot anyways ; mixing buffer sizes in these at the same stage/level is death.

When the vendor builds it in, you don’t have to think about all of these things. But if you build it yourself you really have to understand it.

Mike_Hammett · December 23, 2024, 3:15pm

“Obviously, at some point, buying a big chassis…”

Actually, as I read more about it and watch more videos about it, it seems like that isn’t necessarily true. The claims they have at the top end surpass what any chassis platform I’ve seen is capable of, though I don’t know that they actually have pushed the upper bounds of what’s possible in the real world.

Christopher_Morrow · December 24, 2024, 12:05am

I wonder how large a failure domain folk are willing to accept.

I also don’t know that it’s actually better to have 1 thing vs N things, since management of the
things is probably the expensive part (once you get past space/power which don’t seem to be
part of the calculations here (not in my brief read of the thread at least).

-chris

Mike_Hammett · December 24, 2024, 1:26pm

In the articles I’ve read and videos I’ve watched, they have mentioned varying amounts of reduced power. I didn’t commit them to memory because that wasn’t the part I was interested in at the moment.

Management of the things is a big thing I’ve been concerned about going into more modern systems. So often there’s hand waiving regarding the orchestration piece of non-traditional systems. From what I’ve seen (and I would love to be wrong), you either build it in-house (not a small lift) or you buy something that ends up taking away all of the cost advantages that path had.

Failure domain stuff is part of what I’m trying to learn more about, which goes back to more about the fundamentals of how the fabric works.

David_Sinn2 · December 23, 2024, 6:23pm

From experience I can tell you that once you fully operationalize the pizza box model you will never go back to the chassis model. Why would you trade, open, standards based model for interconnect (OSPF and BGP work great at scale) for proprietary black boxes that do stupid router tricks to make a bunch of discrete components pretend to be one along with giving you the benefit of a huge blast-radius when the software inevitably breaks? Distributed ARP/ND, solved. Actually distributed BFD (not "it’s all running on one line card because customers like LACP bundles spread between line cards and that’s really hard to distribute reliably), solved. Pizza box models means the boxes are fungible. So you can competitively bid between multiple suppliers and pick and choose who you want to buy from depending on what is the most important thing at the time (delivery dates? price? which of them is annoying you the least at that moment in time?). They are also infinitely more scaleable(*) than any big chassis model. State of the art 5 years ago had Internet edge systems deploying with 8k of 400G ports and datacenter deployments with 65k 400G ports using the same fundamental design.

The real downside: vendors don’t like the flexibility that it affords the customer and the meaninglessness of differentiation between vendors that it drives the operators to avoid.

David

(*) - Among some of the critical things to get right from the outset is what peak scale you want to have for the fabric because recabling is not something to be taken lightly…

Saku_Ytti1 · December 24, 2024, 2:54pm

Isn’t the solution here the same? Have all LACP member ports in the same chip?

Tom_Beecher · December 24, 2024, 2:58pm

In the articles I’ve read and videos I’ve watched, they have mentioned varying amounts of reduced power. I didn’t commit them to memory because that wasn’t the part I was interested in at the moment.

The aggregate power load from the 1U boxes is by itself generally ( but not always ) going to be less than the equivalent sized big chassis. ( If you start having to add middle stages that can sometimes not hold true. )

On top of that, many designs allow for DACs to be used for a large percentage of connections, which also has significant power savings.

Tom_Beecher · December 24, 2024, 2:59pm

Isn’t the solution here the same? Have all LACP member ports in the same chip?

Which any self-respecting operator usually doesn’t want to do because of failure domains.

Saku_Ytti1 · December 24, 2024, 3:08pm

I'm just struggling to understand the difference between
stack-of-pizza to chassis here.

Tom_Beecher · December 24, 2024, 3:12pm

Much of this is right, but again with caveats.

The boxes are fungible, to a point. Differences in ASICs, buffers, etc can really create traffic problems if you mix wrong. You don’t want to be yolo’ing whatever is cheap this month in there.
You’re going to eventually have a feature need that commercial management software doesn’t account for. Can they build it for you, and how much is that? If you built your own software to manage it, how much does it cost you to build it?
You’re very correct about how initial mistakes or things you didn’t know can bite you hard later. The wrong growing pain can really hurt if you’re not prepared for it.
Really have to think about the internals and the design. There are some companies who have presented on how they built these things, and when you listen to their protocol design, it makes your head hurt how much overcomplication was built in.

Like I said before, distributed fabrics CAN be amazing, but there are always tradeoffs. There are some things you don’t have to care about with a big chassis, but you do with a DF. And the other way around as well. It’s about picking which set of things you WANT to deal with, or are better for you to deal with than the other.

Tom_Beecher · December 24, 2024, 3:21pm

It’s possible I s/chip/ in my head with a different meaning than you intended, and I am answering a different question.

I generally won’t put all LAG members on the same ASIC, or even same linecard, for failure domain reasons. I also don’t really care about possible challenges with BFD there, because I just use micro-BFD on members + min-links.

Saku_Ytti1 · December 24, 2024, 3:28pm

Quite, it depends what is important for your case. You may want to put
all in one chip for better feature parity in terms of QoS, counters
et.al., especially if you want them to fail as one, because you're
doing it purely for capacity, not for redundancy.
And indeed without uBFD, you're going to run LACP over one interface
in one chip at most, anyhow, and with uBFD each member are going to
run their own, anyhow.

So I wonder, what benefits is OP seeing here when it comes to
pizzabox? To me pizzzabox seems identical here to chassis box with
LACP spanning only single chip.

Tom_Beecher · December 24, 2024, 3:38pm

So I wonder, what benefits is OP seeing here when it comes to
pizzabox? To me pizzzabox seems identical here to chassis box with
LACP spanning only single chip.

Yeah that’s a good question.