Persistent BGP peer flapping - do you care?

A good rule of thumb (possibly from RFC 822) is, be liberal in what you
accept and strict in what you send.

That's a good rule of thumb in general, but I'm not sure it makes sens
eto apply it to the routing fabric of the entire Internet. Any router
that sends you a malformed update is without question broken. I think
the point of saying "drop the session on receipt of a bad update" is
that accepting updates from broken routers is bad practice when those
updates are being used as the basis for routing in the global Internet.
If you leave the session up, and just drop the malformed update, you
are then accepting and passing on (assuming you have peers or
downstreams) routing updates from a router known to be broken.

In terms of the original rule, if you're liberal in what you accept
from BGP (i.e. you reject the malformed update, but accept the other
updates from the same router), you are also (if you have any peeers or
downstreams) effectively being liberal (isntead of strickt) in what you
send them. (Sure, you'll be strict about the *formatting* of what you
send. But you're being liberal in the sense that you're passing on
routes from a known-to-be-broken router.)

I would suspect that during implementation, brand C routers were the
victims during testing, and perhaps the change was made to avoid that

ISTM that if that were the case, Brand C would have chosen to reject
the update but maintain the session, as opposed to accepting and
passing on the update. My guess is that it was just an ordinary bug in
the AS_PATH validation code, that resulted in the BGP implementation
failing to realize that the update was malformed.

The ensuing meltdowns caused by the bug is essentially a problem of the
homogeneousness of the Internet. The malformed update could only
spread from Brand C to Brand C, had there been a lot more diversity in
the core of the Internet, the update would probably not have spread as
faror had as great an impact.

My suggestion would be, rather than a back-off of resetting BGP sessions,
that first attempt strict interpretation (to insulate against completely
insane routers), and then loose interpretation. The model is "Fool me once,
shame on you, fool me twice, shame on me."

On first receiving a bad update, reset. If upon re-establishing the session,
the same bad update is heard, drop the bad update but keep the session up
(along with the messages back, etc.)

The potential risk I see is that you are still passing on updates from
a router known to be broken. From a purely reactive perspective, we
look at past failures and say "when it happened last time, that would
have been a good idea, because all the other updates were good". But
from a proactive, more general perspective, the receiving router really
has no way of knowing just how broken the router on the other end of
the link is.

I do agree, though, with the observation that this can vary on a case
by case basis. For example, a multi-homed end user isn't generally
propogating any of its BGP-received routes. So it might make sense in
such a case to just reject only the malformed packet, because the
alternative is to signifigantly degrade their connectivity over a
single routing update. (And there is no offsetting benefit to the core
routing fabric of the Internet, because such an end-user isn't really
participating in that.) So, yes, having a knob to control the behavior
might be a good idea. But I would stop short of saving that everyone
in the core should configure that knob to leave the session up with a
known to be broken neighbor.

Resetting BGP more than a small, finite number of times is, IMHO, a bad
idea. After all, BGP is a stateful protocol, and state changes should be
triggered deterministically, even if that requires operator input.

Yes, I agree with that also. Dropping a session to a misbehaving peer
is a good idea; restarting it immediately after every drop (so you can
just drop it again when it misbehaves again) is bad.

     -- Brett