Persistent BGP peer flapping - do you care?

Jake_Khuon4 · January 17, 2002, 10:21pm

### On Thu, 17 Jan 2002 17:00:06 -0500, "Christopher A. Woodfield"
### <rekoil@semihuman.com> casually decided to expound upon Susan Hares
### <skh@nexthop.com> the following thoughts about "Re: Persistent BGP peer
### flapping - do you care?":

I agree with your holddown timer proposal in cases of the peer being dropped due to
errors, as the resultant loops can result in extreme prefix dampening. But my
assertation is that BGP peering sessions should be a bit more robust and not drop
everything at the first sign of trouble.

Well, as I recall, the original intent to drop the entire session and
thereby flush that peer from the table is because an invalid advertisement
may be symptomatic of a larger scale table corruption on the part of the
peer thus all advertisements should be invalidated. Dropping the peer and
thereby initiating a coldstart/reset was the conservative solution. I think
some form of peer damping with an exponential decay timer much like route
flap damping would be a good thing. Simply reject the OPEN until the decay
timer has expired.

As for propogation of the bad prefix... well that soapbox has worn paint on
top. If people aren't going to bother following specs in the first place
I'm not sure a new spec will solve anything.

Dave_Israel1 · January 17, 2002, 10:34pm

It's a question of robustness; if the new spec includes a way to be
tolerant of how the spec is (or can be) commonly abused, then the
followers of the spec will not be at the mercy of those who deviate.

In this case, I think that having the option to keep a session that
gives bad routes up, and just dropping the route, is a good answer.
That would allow the user to determine which is preferable for a given
peer: possible corruption or certain disconnection.

-Dave

Vijay_Gill1 · January 17, 2002, 10:42pm

If you have a "bad route" how do you know the rest of the update is good?
The nlri may have gotten corrupted on the wire or between the interface
and the processor (parity error, or some sort of corruption on the bus).
Given that case, in an update, I am not sure you can make a determination
of what is good nlri and selectively propogate and process those. See also
meltdowns circa nov 1998.

/vijay

Dave_Israel1 · January 17, 2002, 11:03pm

You don't. That's why I suggested an option. If you're talking about
somebody who lives in the core, then you probably never want to trust
somebody who hands you a bad update. If, however, you're on the edge,
you might decide to keep trying to talk to somebody who hands you a
bad update, at least until the error rate reaches some threshold (or
the other router goes down in flames), rather than turn off what may
be your only remaining connectivity and sulk alone in your corner.

-Dave

SKH · January 17, 2002, 11:27pm

Dave:

The state machine + option in MIB can make this option workable
via the specification. It is important to let user decide on
a peer basis what is worse.

Thank you very much for this input. Your input makes
the next choices for the BGP spec easier.

Thanks again!!

Sue Hares

SKH · January 19, 2002, 2:39am

vijay:

What else causes repeative peer bounces other than the broken prefix?

sue

PS - I'm away from work from now until Monday morning..

Joel_Aelwyn · January 20, 2002, 9:02pm

There was another notion that never made it off the drawing board (not even
into proposal) regarding "graceful error recovery", a way to assume that
your peer's *entire* table wasn't suspect, just the malformed part, and
notify the peer that there was a problem. Do this too many times, and you
drop the session, still, of course.

The not-even-a-formal-draft is still around somewhere.