Route table growth and hardware limits...talk to the filter

From: owner-nanog@merit.edu on behalf of Jared Mauch
Sent: Sat 9/8/2007 8:17 AM
To: William Allen Simpson
Cc: nanog@nanog.org
Subject: Re: Route table growth and hardware limits...talk to the filter

        I think this is the most important point so far. There are a lot
of providers that think that their announcements need to be global
to manage link/load balancing with their peers/upstreams. Proper use
of no-export (or similar) on the more specifics and the aggregate
being sent out will reduce the global noise significantly.

        Perhaps some of the providers to these networks will nudge them
a bit more to use proper techniques.

        I'm working on routing leaks this month. There have already been
over 2600 leak events today that could have been prevented with as-path
filters of some sort, either on a cutomer or peer. (this would obviously
be in-addition to prefix-list filters).

        - Jared

Maybe this is a dumb question, but why isn't there a BGP option to just
filter more specific routes that have the same AS path as the larger
aggregate? This would allow the networks that announce more specifics for
traffic engineering to still accomplish that, while throwing away the
garbage from someone else that decides to announce their /19 as 33 routes
for no apparent reason. Sure, this would fail if a network decided to
only announce /24's for example without a larger aggregate, but how many
networks are really doing that?

Forrest

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Maybe this is a dumb question, but why isn't there a BGP option to just
filter more specific routes that have the same AS path as the larger
aggregate? This would allow the networks that announce more specifics for
traffic engineering to still accomplish that, while throwing away the
garbage from someone else that decides to announce their /19 as 33 routes
for no apparent reason. Sure, this would fail if a network decided to
only announce /24's for example without a larger aggregate, but how many
networks are really doing that?

http://www3.ietf.org/proceedings/03nov/I-D/draft-grow-bounded-longest-match-00.txt

As a matter of fact.

:slight_smile:

Russ

- --
riw@cisco.com CCIE <>< Grace Alone

Sure, this would fail if a network decided to only announce

> /24's for example without a larger aggregate, but how many
> networks are really doing that?

More than you probably imagine.

Consider the following table:

      asn | count | c24 | c23

You're right, that's way more than I would have imagined. Ok, why not
combine the idea of throwing away more specific routes that have the same
AS path as the larger aggregate with a mechanism that will do something
like the CIDR-REPORT and aggregate bunches of routes that all have the
same AS path. Or is the processing power/memory just not available to
accomplish that?

It seems either option would be better for not breaking connectivity than
to simply reject anything longer than a /21 in 64/7 for example.

Forrest

[snip]

> In this, "count" is the number of prefixes originated by the
> AS that are not covered by any longer prefix (without regard
> to origin);

I of course meant "shorter", not "longer"... oops

That draft seems pretty sensible and in my opinion does more good than the
other options like filtering all routes that are longer than the RIR
minimum or hoping that the offenders magically wake up one day and decide
to clean up their announcements.

I think my suggestion is less complicated than what is contained in the
draft however. I'm simply saying that we need an option, we'll call it
squash-worthless-more-specifics, that you can apply on any specific BGP
neighbor. Supposing you receive the following routes......

192.168.0.0/16 AS11111 AS22222 AS33333
192.168.1.0/24 AS11111 AS22222 AS33333
192.168.2.0/24 AS11111 AS55555 AS44444 AS33333
192.168.3.0/24 AS11111 AS22222 AS33333

It would keep the 192.168.0.0/16 and 192.168.2.0/24 because they have
different AS Paths and throw away 192.168.1.0/24 and 192.168.3.0/24.

Judging from the CIDR-REPORT this would eliminate alot of garbage without
affecting connectivity to people that are multi-homing with smaller PA
blocks, or announcing more specifics to different providers for traffic
engineering.

Forrest

Maybe this is a dumb question, but why isn't there a BGP option to just
filter more specific routes that have the same AS path as the larger
aggregate?

i think i filed that request case three or more years ago. zero response.

randy

IIRC, this has come up on cisco-nsp before, and the response has been that it's very "icky" to do and doesn't really save anything on most platforms.

In the example case of

1) 192.168.0.0/16 AS11111 AS22222 AS33333
2) 192.168.1.0/24 AS11111 AS22222 AS33333
3) 192.168.2.0/24 AS11111 AS55555 AS44444 AS33333
4) 192.168.3.0/24 AS11111 AS22222 AS33333

Forrest says the router should be smart and reject paths 2 and 4 because they're covered by 1. Now what happens when 1 is revoked? Do we lose connectivity to 2 and 4, or does the router have to keep track of all these dependant routes and reinstall 2 and 4 when 1 is lost?

Granted, the overhead involved would maybe be worth it on a platform like the 6500/7600, where you can end up with a surplus of RAM and not enough TCAM, if only the "active" routes were stored in TCAM, but it's not exactly in cisco's best interest to extend the life of gear they'd like to see replaced with new cisco gear. I just can't understand why they won't/haven't done a Sup32-3bxl for those using this platform but not moving enough Gbps to need the traffic capabilities of the Sup720-3bxl.

IIRC, this has come up on cisco-nsp before, and the response has been that
it's very "icky" to do and doesn't really save anything on most platforms.

In the example case of

1) 192.168.0.0/16 AS11111 AS22222 AS33333
2) 192.168.1.0/24 AS11111 AS22222 AS33333
3) 192.168.2.0/24 AS11111 AS55555 AS44444 AS33333
4) 192.168.3.0/24 AS11111 AS22222 AS33333

Forrest says the router should be smart and reject paths 2 and 4 because
they're covered by 1. Now what happens when 1 is revoked? Do we lose
connectivity to 2 and 4, or does the router have to keep track of all
these dependant routes and reinstall 2 and 4 when 1 is lost?

Based on what seems to be reported by the CIDR-REPORT, I would say that if
#1 is revoked then it's likely all of the routes with the same AS Path
will be revoked anyway. But if not, rather than the router having to
recalculate whether the more specifics should or should not be accepted
at each routing update, you could apply the same principles that route
flap dampening uses. Reject paths #2 and #4 for X number of minutes
before you bother checking again to see if the larger aggregate is still
there.

it's not exactly in cisco's best interest to extend the life of gear
they'd like to see replaced with new cisco gear.

Perhaps that's true, but perhaps another company like Juniper would
implement it feeling that it would give their equipment an edge over their
competitors. If the number of routes was causing me a large problem with
my routers, I would certainly look more closely at another vendor's gear
if it offered a better solution for dealing with the problem than
filtering based on RIR minimums.

Forrest

IIRC, this has come up on cisco-nsp before, and the response has been that
it's very "icky" to do and doesn't really save anything on most platforms.

In the example case of

1) 192.168.0.0/16 AS11111 AS22222 AS33333
2) 192.168.1.0/24 AS11111 AS22222 AS33333
3) 192.168.2.0/24 AS11111 AS55555 AS44444 AS33333
4) 192.168.3.0/24 AS11111 AS22222 AS33333

Forrest says the router should be smart and reject paths 2 and 4 because
they're covered by 1. Now what happens when 1 is revoked? Do we lose
connectivity to 2 and 4, or does the router have to keep track of all
these dependant routes and reinstall 2 and 4 when 1 is lost?

Based on what seems to be reported by the CIDR-REPORT, I would say that if
#1 is revoked then it's likely all of the routes with the same AS Path
will be revoked anyway. But if not, rather than the router having to
recalculate whether the more specifics should or should not be accepted
at each routing update, you could apply the same principles that route
flap dampening uses. Reject paths #2 and #4 for X number of minutes
before you bother checking again to see if the larger aggregate is still
there.

The problem with this is that if you reject the routes initially and then later need them, then they're not in your incoming BRIB to reconsider. BGP is an incremental protocol. You can either save an update or you can ignore it, but if you ignore it, it's just plain gone.

If you do save it in your BRIB, then you can do this filtering between RIB and FIB. That turns out to be a completely local feature, requiring no protocol changes or additions whatsoever, and thus does not even require an RFC or Internet draft. This feature has been seen in some circles under the name "ORIB". Ask YFRV's PM for it. :wink:

Note that this feature *is* CPU intensive. This also does not decrease the RP RAM usage the way that update filtering would. In fact, due to the overhead of tracking filtered and non-filtered prefixes, there is additional RP RAM usage. YMMV.

Tony

If you do save it in your BRIB, then you can do this filtering between
RIB and FIB. That turns out to be a completely local feature, requiring
no protocol changes or additions whatsoever, and thus does not even
require an RFC or Internet draft. This feature has been seen in some
circles under the name "ORIB". Ask YFRV's PM for it. :wink:

Note that this feature *is* CPU intensive. This also does not decrease
the RP RAM usage the way that update filtering would. In fact, due to
the overhead of tracking filtered and non-filtered prefixes, there is
additional RP RAM usage. YMMV.

so, bottom line, no help other than reducing fib?

randy

Hm, are you going to communicate the filtered version of this BGP
table to your BGP peers, or just pass everything but install the
filtered RIB into FIB?

(This is where someone pipes up with some modelling/research done into
convergence as a function of npeers, CPU available, RIB, etc.)

Adrian

I just can’t understand why they
won’t/haven’t done a Sup32-3bxl for those using this platform but not
moving enough Gbps to need the traffic capabilities of the Sup720-3bxl.

This part here just boggles the mind. Not everybody out there that needs full routes is pushing enough bandwidth to justify the cost of a 720gbps backplane – medium sized datacenters, regional ISPs, etc all really like full routes but may never see even 30gbps of traffic. Everybody I’ve talked to about this particular problem has the same feelings – that big C is hanging their 6509 user base out to dry.

If you do save it in your BRIB, then you can do this filtering between
RIB and FIB. That turns out to be a completely local feature, requiring
no protocol changes or additions whatsoever, and thus does not even
require an RFC or Internet draft. This feature has been seen in some
circles under the name "ORIB". Ask YFRV's PM for it. :wink:

Note that this feature *is* CPU intensive. This also does not decrease
the RP RAM usage the way that update filtering would. In fact, due to
the overhead of tracking filtered and non-filtered prefixes, there is
additional RP RAM usage. YMMV.

so, bottom line, no help other than reducing fib?

Not unless you're actually willing to accept a real change in the results.

how about a filter between in-rib and what you actually crank through
the churning clothes washer? pass on the in-rib, calc on the phyltered
data. so when shorter prefix is withdrawn, you can look for next best
candidate.

note thatv my original proposal/case some years back allowed a number of
flavors of phylter, longer+same-next-hop, longer+same-as-path,
longer+same_origin-as.

randy

The problem with this is that if you reject the routes initially and
then later need them, then they're not in your incoming BRIB to
reconsider. BGP is an incremental protocol. You can either save an
update or you can ignore it, but if you ignore it, it's just plain
gone.

If BGP is an incremental protocol (which of course, I know it is), why
doesn't a certain vendor treat it that way?

*cough* BGP Scanner *cough*.

In any event, if the feature was implemented post-received routes (just
like prefix-lists were with soft-reconfig), having a copy of the table
that was sent to you by a peer, this would be trivial to do in code.
Would it be CPU intensive? Perhaps, but so is having 225k routes and
climbing. I'd submit that the CPU burned to do a route lookup on a
BGP-RIB when a route is withdrawn or announced to see if something less
specific exists would not in fact be that bad -- routing lookups, isn't
that what a router is supposed to do?

Randy Bush wrote:

note thatv my original proposal/case some years back allowed a number of
flavors of phylter, longer+same-next-hop, longer+same-as-path,
longer+same_origin-as.

I vaguely remember it. Point to the draft for us again, please?

Does anybody implement anything like this? Time to name vendor names!

Interesting you should mention this as we are planning to
test an "improvement" to the BGP Scanner process, BGP
Support for Next-Hop Address Tracking.

Some notes from the vendor:

"The BGP Support for Next-Hop Address Tracking feature is
enabled by default when a supporting Cisco IOS software
image is installed. BGP next-hop address tracking is event
driven. BGP prefixes are automatically tracked as peering
sessions are established. Next-hop changes are rapidly
reported to the BGP routing process as they are updated in
the RIB. This optimization improves overall BGP convergence
by reducing the response time to next-hop changes for
routes installed in the RIB. When a bestpath calculation is
run in between BGP scanner cycles, only next-hop changes
are tracked and processed."

How much of an improvement this will make is what we are
hoping to find out.

Cheers,

Mark.

There are Vendor C platforms that can push much more than 30Gbit, and take a full table comfortably, that cost a lot less than 6500 series kit.

note thatv my original proposal/case some years back allowed a number of
flavors of phylter, longer+same-next-hop, longer+same-as-path,
longer+same_origin-as.

I vaguely remember it. Point to the draft for us again, please?

hot and humid, here in s'pore. no draft except in aircon. i stopped
doing that <bleep> years ago. waste of time.

randy

[snip]

It seems either option would be better for not breaking connectivity than

[snip]
Flatly, in my experience breaking connectivity for the apathetic
or clueless folks abusing the commons is the only way to get them
to change behavior. At worst, your own customers are inconvenienced
while the other party gets rulers and prepares for a locker room
measuring contest, and you relevent first poking a hole in a policy.
At best, clued technical people trapped in the remote networks'
organization get an "I told you so" reason to Do The Right Thing.

You can rathole the discussion on specific implementations and
memory structures all the livelong day, but that won't change any
individual operator's behavior. Are your confident YFRV will
deliver any updated feature[s] in a timescale that fits your own
networks' projected FIB & memory crush? Will it actually address
the problem or just move the curve a little further into the future?

Cheers,

Joe