Global BGP - 2001-06-23

Brett Frankenberger wrote:

> A) Ciscos flap sessions, according to the only reports I've heard.

Is it an invalid AS_PATH? If so, if such is received by a Cisco, the
Cisco is required by the RFC to drop the session. Failing to do so
(and then propogating the bogus advertisement) was the cause of the
original problem ... AFAIK, the fix (which was released a long time
ago, but may not yet be running everywhere) causes the Cisco to behave
properly, which is to drop the session.

Clarification: Ciscos take a buggy route, and turn it into an invalid
one. This causes Cisco peers to flap the session (yes, as they should),
and some other vendors (B, below) appear to have more serious issues.

> B) <X> routers were crashing, either due to the bug, or the session resets.
> Thus, <X> is being flogged. I have reports of at least one <Y> having
> problems, as well.

Well, OK. If <X> is crashing, then <X> has a problem. And I didn't
mean to imply that they didn't. Mostly, I was posting because I
frequently hear the "Bay vs. Cisco" crashes of yore reported as "Bay's
were dropping BGP sessions". That implies that the Bay was broke, when
in reality Bay (and most other non-Cisco implementations) was doing
what was required by the RFC.

The reason for my post, not knowing who <X> is (although I could
probably guess) or what <X> was doing, was to clarify that routers that
drop BGP sessions upon receiving invalid advertisements are not broken;
but rather, they are doing what is required.

A good point, and entirely true. I apologize for not being clear about
the bug, but I was/am trying to step carefully around the NDAs. And yes,
they're annoying, and there are probably some people who believe I'm
violating it even now. (Hopefully not the lawyers...)

> I have no data on Bay; my apologies if this wasn't clear. Bay was *only*
> being referenced as a historical point of note. No attempt at FUD, and my
> apologies if anyone read it that way.

And I wasn't attempting to defend them, either -- I'm just curious
about the problem.

Anyway, someone had to be passing this advertisement around ... if the
Ciscos were dropping the session in response to it, and <X>'s were
crashing, who's left to pass the bad advertisement around? Cisco with
older code that propogated the advertisement upon receipt, instead of
issuing a NOTIFY and tearing the session down?

I'm not entirely clear on this; from the bug ID, it implies that iBGP
may be treated differently than external peers (specifically, part of
it appears to involve appending one's own ASN, possibly; again, I'm
not entirely clear on it, even reading the bug report).

Naturally, you might be unable to answer the above, due to NDA ...
mostly, I'm just fishing for details (from anywhere) on what happened.

Sorry. As Sean said... most of it is covered by NDAs, and this is
exactly what will lead to required outage reporting for everyone, if
they don't start relaxing it some. From our point of view (here), a
lot of the issues were second-order, caused by the number of flaps in
the global table from various directions, and/or the bug in vendor
<X>'s equipment causing the reboots rapidly. Though, to their credit,
<X> was good about handling the ticket, and had engineers talking to
us rapidly, etc etc. Reasonable handling, IMO.