RE: Global BGP - 2001-06-23 - Vendor X's statement...

> There will always be cases where Vender A thinks they are correct and
> Vendor B thinks they are correct, and they differ. And you are
> correct, either the sender has done something wrong or the receiver
> has done something wrong, hence the Internet motto.

But there there should be no room for debate, one side is right and the
other side is wrong. If there is really a grey area, the solution is to
fix the wording of the standards document, not to try and overlook the
problem.

I'm not proposing we overlook the problem. However, software is very
bad at deciding who is right and who is wrong. Other than malware, most
vendor software does not deliberately send bad data. The software, or
rather the programmer who wrote the software, thought the program was
sending correct data. Later when humans looked at the data, humans
decided the data was wrong and fixed the software.

What do we do between the time the software makes an error, and time the
humans can interven?

Have the software, with no human oversight, nuke everything? The Blue
Screen Of Death may be a very "safe" for software to do when it encounters
an error. However, it is not a very good thing for system availability.

I agree error handling is "hard."

Aborting the entire BGP session makes the Internet more brittle than
necessary. In the hours/days between the software sending the data, and
the humans fixing it, the network was hurting a lot more than you would
expect from a single bad route. The constant cycle of abort, reset, route
flap was an amazing multiplier effect of one bad route.

I agree that in this case it is possible to have ignored the bad AS PATH
and drop the route without disturbing the session originating the bad
information. This is one specific example could probably have been handled
better with a non-fatal notification (with big red lights and buzzers).
However, it was unacceptable for that router to propagate the bad
information to others.

I agree, you must have both sides (conservative send, and liberal receive).

Sending bad data is not acceptable. Cisco should not send bad data.

Crashing/aborting when you receive bad data isn't acceptable either. Bad
data happens, Vendor X should not abort if it had other options.

Sometimes there is no alternative besides aborting. However, the RFC makes
aborting a requirement. There are errors BGP implementations could recover
(with blinking red lights and loud buzzers). The RFC should give the
option of continuing to implementations.

I was following the standard isn't a good reason to crash. If following
the standard causes the Internet to flap like a hummingbird for a day,
we need to get the standard changed (as well as fix the existing
implementations).

These are not mutually exclusive goals.

   1) Modify the standard so an error does not have as much impact worldwide
   2) Fix the current implementations

Yes, a pedestrian may have the right of way in the crosswalk. But proving
your point by having the semi-truck flatten you isn't very smart.

Date: 26 Jun 2001 19:23:42 -0700
From: Sean Donelan <sean@donelan.com>

[ heavy snipping throughout ]

I agree, you must have both sides (conservative send, and liberal
receive).

Sending bad data is not acceptable. Cisco should not send bad data.

I think that everyone agrees here... the question is, what penalty to
apply and with what scope when some router spews bad data?

Crashing/aborting when you receive bad data isn't acceptable either.
Bad data happens, Vendor X should not abort if it had other options.

1. Flapping. If the route is bad, put the route in "time out corner".
2. AS-PATH filtering. If the as-path looks funny, kill the route.
3. Bogon/spoofing filtering. If the source IP is funny, block traffic
   from that IP.

Solutions to routing problems follow a "punishment fitting the crime"
system. In this sense, I agree with your logic about penalizing a single
route being of appropriate scope for bad BGP.

Heavy flapping is bad because of a two-word phrase: state maintenance.
Any proposed solution should avoid intensive state maintenance, else it
will be as much of a pain as flapping.

My gut feel is that I'd rather nuke the connection with a bad router,
deducing "we don't trust this one". Looking at the above 1-3, however,
this sort of behavior does not make sense:

1. If a route flaps, do we damp[en] all routes from it, because { one is |
   some are } "bad"? No.
2. When some idiot redistributes their upstreams' routes, do we kill their
   BGP session? I wish, but the answer is no.
3. When funky packets land, do we blackhole anything from the sending
   router? Nope; this would be increasingly dangerous as one got farther
   into the core.

The above are examples of layer-eight mistakes. If we consider bad data
to be the result of a loose nut between the keyboard and the chair, then
we should probably penalize on a per-route basis.

Up to this point, I agree with you, Sean [Donelan]. But the $100k
question (100kroute question?) is:

"Does bad data fit in this category, or does it mean that the router on
the other end is so fscked that we kill the connection?"

It would seem that the RFCs imply the latter. If we suggest otherwise, I
should think that we should argue on these grounds... this is where it is
handy to have data that will either prove or disprove the claim that "bad
data = bad router".

My $0.01 (only $0.01 because I'm at the edge),
Eddy

How about if there was a tool you could run against a BGP speaker
which sent a series of deliberately pathological and bogus updates,
and logged the behaviour of the box under test?

I haven't heard anybody say that vendor X, Y or Z are refusing to
fix bugs when they are pointed out to them (quite the contrary).
The trick would seem to be to report the bugs before they are found
in the wild.

What BGP acceptance tests do people currently run against prospective
vendors' hardware?

Joe