Re: maximum ipv4 bgp prefix length of /24 ?

jakobheitz · October 2, 2023, 12:40am

Among the issues:

Suppose the FIB has all the /24 components to make a /20, so it programs a /20.

Then one of the /24’s changes nexthop. It now has to undo all that compression

by reinstalling some of the routes and figuring out the minimum set of /21, /22, /23, /24

to make it happen. Then to avoid a transient, it needs to make before break.

Quite a bit of FIB programming needs to happen just to modify a single /24.

Then the next /24 in the set also modifies its nexthop. and so on for 100000 routes.

All because a peer link flapped.

Affecting convergence.

Then you need to buy a line card that can hold all the individual routes, because you

can’t always compress, because not all the routes in your compressed set have the

same nexthop during a transient.

Finally, it’s all nicely compressed.

Now what? You have lots of empty slots in your FIB.

I’m sure lots of nerds can come up with transient reduction algorithms, but I’d rather not.

William_Herrin · October 2, 2023, 1:32am

Yeah... all this stuff is on the same level of complexity as
implementing a B-Tree. Standard task on the road to an undergraduate
computer science degree. Compared to decoding a BGP update message,
where nearly everything is variable length and you have to nibble away
at the current field to find the start of the next field, this is a
cakewalk.

It doesn't actually get complicated until you want to do more than
just joining adjacent address blocks.

Regards,
Bill Herrin

jakobheitz · October 2, 2023, 4:55am

While I did allude to some of the complexity, my main point

is that FIB compression does not allow you to install a FIB with less memory.

Because you must be prepared for transients during which the FIB needs to store

mostly uncompressed anyway.

All it does is to increase convergence time.

William_Herrin · October 2, 2023, 7:56am

Hi Jakob,

The math disagrees. It's called "oversubscription," and we use it all
over the place in network engineering.

There are only a handful of route patterns that'd result in no
compression at all. They'd have to be intentionally created, and
that'd be a hacking challenge in and of itself. The patterns in
question don't align with the distribution of addresses on the
Internet.

If you're at 80% FIB after compression, a compression transient could
plausibly bump you to 85%. The odds of a natural transient bumping you
to 100% are infinitesimal. If you try to run at 95% after
compression... well, I'm sure someone will try it, but that's PEBKAC
not compression's fault.

FIB compression ranges from 30% in simple core scenarios to more than
90% in edge scenarios with advanced compression. Even keeping
reasonable slack for transients, you're going to get some bang for
your buck. All it means is that you have to keep an eye on your FIB
size as well, since it's no longer the same as your RIB size.

Regards,
Bill Herrin

Nick_Hilliard3 · October 2, 2023, 8:18am

the point Jacob is making is is that when using FIB compression, the FIB size depends on both RIB size and RIB complexity. I.e. what was previously a deterministic 1:1 ratio between RIB and FIB - which is straightforward to handle from an operational point of view - becomes non-deterministic. The difficulty with this is that if you end up with a FIB overflow, your router will no longer route.

That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many.

Nick

William_Herrin · October 2, 2023, 8:39am

The difficulty with this is that if you end up with a
FIB overflow, your router will no longer route.

Hi Nick,

That depends. When the FIB gets too big, routers don't immediately
die. Instead, their performance degrades. Just like what happens with
oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the
TCAM (out of the fast path) up to the main CPU. As a result, the PPS
(packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the
cache. They haven't actually been pushed out of the fast path, but the
fast path gets a little bit slower. The PPS degrades, but not as
sharply as with a TCAM-based router.

That said, there are cases where FIB compression makes a lot of sense,
e.g. leaf sites, etc. Conversely, it's not a generally appropriate
technology for a dense dfz core device. It's a tool in the toolbox, one
of many.

The case for FIB compression deep in the core is... not as obvious as
the case near the edge for sure. But I wouldn't discount it on any
installation that has a reasonably defined notion of "upstream," as
opposed to installations where the only sorts of interfaces are either
lateral or downstream.

Look at it this way: here are some numbers from last Friday's BGP report:

BGP routing table entries examined: 930281
    Prefixes after maximum aggregation (per Origin AS): 353509
    Deaggregation factor: 2.63
    Unique aggregates announced (without unneeded subnets): 453312

Obviously adjacent routes to the same AS aren't always going to have
the same next hop. But I'll bet you that they do more often than not,
even deep in the core. Even if only half the adjacent routes from the
same AS have the same next hop when found deep in the core, according
to these numbers that's still a 30% compression. If you keep a 10%
slack for transients, you still have a 20% net gain in your
equipment's capability versus no compression.

Regards,
Bill Herrin

Jon_Lewis1 · October 2, 2023, 1:03pm

In my experience, that's incorrect. FIB compression allows you to run with full tables on devices that can no longer accommodate full [uncompressed] tables in whatever is used to store FIB. Obviously, it does require more RAM due to the complexities of keeping track of more state. But RAM is typically far more easily upgraded (cheaply) than TCAM.

Even if it does increase convergence time, that's a compromise many are willing to make if the alternative is toss your current gear and replace it with newer gear years sooner.

Tom_Beecher · October 2, 2023, 1:05pm

That depends. When the FIB gets too big, routers don’t immediately
die. Instead, their performance degrades. Just like what happens with
oversubscription elsewhere in the system.

If you consider blackholing traffic because the relevant next-hops aren’t present in the FIB to be looked up as “degradation”… I guess?

If you keep a 10%
slack for transients, you still have a 20% net gain in your
equipment’s capability versus no compression.

This ignores whatever % of something else you have up to get effective compression.

Tim_Franklin · October 2, 2023, 1:19pm

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance?

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

For starters, as long as you have *somewhere* you can point a default at in the worst case, even if it's far from the *best* route, you make damn sure you always install a default.

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice...)

Understand which routes your customers care about / where most of your traffic goes? Set the "FIB-preference" on those routes as you receive them, to give them the greatest chance of getting installed.

Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than "shortest -> punt to CPU" or "LRU -> punt to CPU".

Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place?

Thanks,
Tim.

Joshua_Miller · October 2, 2023, 1:40pm

Seems like we’ve reached the limits of apriori speculation. At this point I’d like to see data demonstrating that it’s at least viable from a statistical perspective.

If someone is motivated to demonstrate this, a “backtest” against historical data would be the next step. Later, one could design the study to reveal “transient degradation” (loops, drops, etc.) and their probability, though the duration would be more a function of the implementation. It would be best to “backtest” the status quo as a control because it too has transient degradation, for a more apples-to-apples comparison.

I’m not sufficiently motivated (nor knowledgeable in statistics) to take this on. I see this more in the domain of vendors to determine the best approach for their implementation.

Tom_Beecher · October 2, 2023, 1:51pm

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice…)

What you are describing is exactly what the /24 convention is doing already, just with different mask lengths.

By and large, RIB/FIB size can be effectively managed today by thoughtful use of policies. If a point is reached that doesn’t work anymore, it’s probably time to re-evaluate the hardware or the design.

Owen_DeLong · October 2, 2023, 4:35pm

While I did allude to some of the complexity, my main point

is that FIB compression does not allow you to install a FIB with less memory.

Because you must be prepared for transients during which the FIB needs to store

mostly uncompressed anyway.

All it does is to increase convergence time.

Owen_DeLong · October 2, 2023, 4:45pm

It was never 1:1 if you have more than one path for any route. The FIB only contains the chosen path(s) to any destination even without fib compression.

That said, there are cases where FIB compression makes a lot of sense, e.g. leaf sites, etc. Conversely, it's not a generally appropriate technology for a dense dfz core device. It's a tool in the toolbox, one of many.

Even at a dense DFZ core device, there are a large number of single-origin consecutive /24s in the table which can be fib compressed with no loss. For a long time, someone was teaching up and coming operators in Asia that they should always announce everything as disaggregated /24s to guard against route hijacking. This unfortunate practice persists still, making FIB compression quite practical even at core nodes.

Owen

Owen_DeLong · October 2, 2023, 4:52pm

Isn’t that pretty much what Geoff Huston has done with the weekly reports William quoted earlier in this thread?

Sure, that’s from a limited set of perspectives, but it probably represents the minimum achievable compression in most circumstances.

Owen

William_Herrin · October 2, 2023, 5:57pm

More where those came from if you google "BGP FIB compression paper."

Regards,
Bill Herrin

William_Herrin · October 2, 2023, 6:02pm

Come on man, go re-read the post. The two paragraphs you cut literally
explained what happens -instead of- routes dropping out of the FIB or
being black holed.

Regards,
Bill Herrin

Matthew_Petach2 · October 2, 2023, 6:24pm

That depends. When the FIB gets too big, routers don’t immediately
die. Instead, their performance degrades. Just like what happens with
oversubscription elsewhere in the system.

With a TCAM-based router, the least specific routes get pushed off the
TCAM (out of the fast path) up to the main CPU. As a result, the PPS
(packets per second) degrades really fast.

With a DRAM+SRAM cache system, the least used routes fall out of the
cache. They haven’t actually been pushed out of the fast path, but the
fast path gets a little bit slower. The PPS degrades, but not as
sharply as with a TCAM-based router.

Spit-balling here, is there a possible design for not-Tier-1 providers where routing optimality (which is probably not a word) degrades rather than packet-shifting performance?

If the FIB is full, can we start making controlled and/or smart decisions about what to install, rather than either of the simple overflow conditions?

For starters, as long as you have somewhere you can point a default at in the worst case, even if it’s far from the best route, you make damn sure you always install a default.

Then you could have knobs for what other routes you discard when you run out of space. Receiving a covering /16? Maybe you can drop the /24s, even if they have a different next hop - routing will be sub-optimal, but it will work. (I know, previous discussions around traffic engineering and whether the originating network must / does do that in practice…)

The problem with this approach is you now have non-deterministic routing.

Depending on the state of FIB compression, packets may flow out interfaces that are not what the RIB thinks they will be.
This can be a good recipe for routing micro-loops that come and go as your FIB compression size ebbs and flows.

Taking your example: RTR-A----------RTR-B---------RTR-C
RTR-A is announcing a /16 to RTR-B
RTR-C is announcing a /24 from within the /16 to RTR-B, which is passing it along to RTR-A

If RTR-B’s FIB compression fills up, and falls back to “drop the /24, since I see a /16”, packets destined to the /24 arriving from RTR-A will reach RTR-B,
which will check its FIB, and send them back towards RTR-A…which will send them back to RTR-B, until TTL is exceeded.

BTW, this scenario holds true even when it’s a default route coming from RTR-A, so saying “well, OK, but we can do FIB compression easily as long as we have a default route to fall back on” still leads to packet-ping-ponging on your upstream interface towards your default if you ever drop a more specific from your FIB that is destined downstream of you.

You’re better off doing the filtering at the RIB end of things, so that RTR-B no longer passes the /24 to RTR-A; sure, routing breaks at that point, but at least you haven’t filled up the RTR-A to RTR-B link with packets ping-ponging back and forth.

Your routing protocols depend on packets being forwarded along the interfaces the RIB thinks they’ll be going out in order for loop-free routing to occur.
If the FIB decisions are made independent of the RIB state, your routing protocols might as well just give up and go home, because no matter how many times they run Dijkstra, the path to the destination isn’t going to match where the packets ultimately end up going.

You could of course fix this issue by propagating the decisions made by the FIB compression algorithm back up into the RIB; at least then, the network engineer being paged at 3am to figure out why a link is full will instead be paged to figure out why routes aren’t showing up in the routing table that policy says should be showing up.

Understand which routes your customers care about / where most of your traffic goes? Set the “FIB-preference” on those routes as you receive them, to give them the greatest chance of getting installed.

Not a hardware designer, I have little idea as to how feasible this is - I suspect it depends on the rate of churn, complexity of FIB updates, etc. But it feels like there could be a way to build something other than “shortest → punt to CPU” or “LRU → punt to CPU”.

Or is everyone who could make use of this already doing the same filtering at the RIB level, and not trying to fit a quart RIB into a pint FIB in the first place?

The sane ones who care about the sanity of their network engineers certainly do. ^_^;

Thanks,
Tim.

Thanks!

Matt

Tom_Beecher · October 2, 2023, 6:31pm

Come on man, go re-read the post. The two paragraphs you cut literally
explained what happens -instead of- routes dropping out of the FIB or
being black holed.

Ok

Tim_Franklin · October 2, 2023, 6:44pm

Had NOT considered the looping - that’s what you get for writing in public without thinking it all the way through blush.

Thanks for poking holes appropriately,
Tim.

Mark_Tinka4 · October 2, 2023, 7:19pm

Like I said, it's going to be a messy experiment - for probably a decade, at least.

As Saku has highlighted as well, vendors are likely to find a lot of sludge in this experiment that they will never be able to share with us... likely because it will be too complicated for us to understand in a coherent way, or likely because the experiment changes so rapidly it makes no sense to tell us about issues which will quickly become obsolete.

Many lessons will be learned, but ultimately, one would be naive to think this black box will just work.

All I want is a "set routing-options fib-compression disable" present for Christmas.

Mark.