Global BGP - 2001-06-23 - Vendor X's statement...

E.B. Dreger wrote:

> Date: Wed, 27 Jun 2001 13:21:20 -0400
> From: Matt Levine <matt@deliver3.com>

> Agreed, so throw the bad route to the bit bucket and leave the bgp
> session open, or at the very least (as others have suggested) give me
> an OPTION to do that. Bad enough we were only operating at 33%
> capacity, however, if we only had transit from the 4 that were giving
> us the bad route, we would have lost connectivity totally. While it

<imesho>

On the surface, this appears to be correct.

But let's ask ourselves _why_ those upstreams had bad routes. It's
because _they_ did not filter at the edge. If bad routes leak, but are
filtered before reaching the core, then they never make it to you.

IOW, your concern is a non-issue if the large providers apply similar
filtering at the edge. You wouldn't be cutting yourself off because the
provider in question would have filtered it long ago.

Correct. However, this means I have to place my complete trust in them
to Do Things Right (well, them, and more importantly in this case, their
vendors). As Saturday has demonstrated, this is not a safe assumption,
in that there appears to be some significant number of boxes in the core
which will propagate bad routing data, even if they are also resetting
the sessions which it came from (note: I'm not saying it's Cisco. It
might be; historically, Ciscos have done this before. But I have no
direct evidence that they did, or didn't; only the inferrence that it
had to be *something* used on a very widespread basis, given the number
of peers that had the problem simulataneously. Oh, and I *do* know,
from direct observation, that the Ciscos facing us were either causing
this bug themselves (possible, but it doesn't seem terribly likely
given the spread of them), or transiting the route to us when they
should have been ditching it, along with the session).

Do it at the edge, and the Internet does not become any more brittle.

The same with source-filtering IPs. Do it at the edge, and the problem
goes away. Now, *how* long has it taken to implement this?

Someone said, a few messages ago, that the purpose of a routing protocol
is to avoid loops. I disagree. The purpose of a routing protocol is to
propagate good, viable routing information. Thus, it MUST have a way
to deal with bad routing information, but it SHOULD (IMO) have a way
to deal with said information that is not necessarily fatal. We have
quite clearly demonstrated that it is a non-trivial possibility that
A) bad routes will manage to become widespread, through various bugs,
and B) it is possible to have one or two bad routes in an otherwise
useful table of 100,000 routes.

When reality says the basis of your design theory is inaccurate, well,
it's time to look at revamping the design to accomodate for it, if
that can be done without trashing the whole thing (sometimes even if
it takes that, but I see no call for it in this case, as it's not that
severe, and it is entirely fixable without tossing out everything that
has worked so far).

As for making money... if the general agreement is that "BGP death
penalty" is correct, let the violators and bad BGP speakers face the
consequences of spewing garbage.

When the violators are "Almost ever major transit provider", this means
you'll be off in a corner playing Internet by yourself. This isn't very
attractive to most potential customers, no matter how RFC compliant you
are. Again, Saturday showed that this is, in fact, the case. I would love
to see the core problem fixed, and never *need* to invoke anything that
ditches single bad routes because the only breakages occur when a peer
goes completely nuts and spews garbage at me. Unfortunately, this hasn't
been the case for a long time now, and doesn't appear terribly likely to
be fixed tomorrow, given what the press releases have said about various
vendors...

It seems that the right way to handle a malformed route or two depends on
who's speaking and who's listening.

If I'm a backbone provider and I hear a bad route from a customer, I'm
going to drop that connection. I have no incentive to take any risks.
This is just as the RFC currently reads.

If I'm a customer, I really don't want to shut off the service that I'm
paying for. If I'm not going to propagate the routes beyond my borders,
why should I drop the whole session? The risk is entirely mine, and a
partially corrupted table is better than no connectivity at all.

From this point of view, it seems that the RFC should be loosened to allow

configuration of a BGP peer to continue the session and ignore the route.

Perhaps there should be wording to the effect that it is not acceptable
practice to propagate routes from the offending router beyond your
borders.

Maybe there is even a way to phrase it that means "it's not OK to
propagate routes from a suspect router back into the core of the
Internet." In practice these words have meaning because "upstream" and
"downstream" are defined by the flow of money, and economics suppresses
loops.

Steve Schaefer

Dashbit - The Leader In Internet Topology
www.dashbit.com www.traceloop.com