Did your BGP crash today?

Date: Mon, 30 Aug 2010 10:55:03 -0500
From: Jack Bates <jbates@brightok.net>

Florian Weimer wrote:
> This whole thread is quite schizophrenic because the consensus appears
> to be that (a) a *researcher is not to blame* for sending out a BGP
> message which eventually leads to session resets, and (b) an
> *implementor is to blame* for sending out a BGP messages which
> eventually leads to session resets. You really can't have it both
> ways.
>

As good a place to break in on the thread as any, I guess. Randy and
others believe more testing should have been done. I'm not completely
sure they didn't test against XR. They very likely could have tested in
a 1 on 1 connection and everything looked fine.

I don't know the full details, but at what point did the corruption
appear, and was it visible? We know that it was corrupt on the output
which caused peer resets, but was it necessarily visible in the router
itself?

Do we require a researcher to setup a chain of every vender BGP speaker
in every possible configuration and order to verify a bug doesn't cause
things to break? In this case, one very likely would need an XR
receiving and transmitting updates to detect the failure, so no less
than 3 routers with the XR in the middle.

What about individual configurations? Perhaps the update is received and
altered by one vendor due to specific configurations, sent to the next
vendor, accepted and altered (due to the first alteration, where as it
wouldn't be altered if the original update had been received) which
causes the next vendor to reset. Then we add to this that it may pass
silently through several middle vendor routers without problems and we
realize the scope of such problems and why connecting to the Internet is
so unpredictable.

This only way they could have caught this one was to have tested to a
CRS which had another router to which it was announcing the attribute in
a mal-formed packet. Worse, the resets should just keep happening as the
CRS would still have the route with the unknown attribute which would
just generate another malformed update to cause the session to reset
again.

While it may be possible to recover from something like this, it sure
would not be easy.

This only way they could have caught this one was to have tested to a
CRS which had another router to which it was announcing the attribute in
a mal-formed packet. Worse, the resets should just keep happening as the
CRS would still have the route with the unknown attribute which would
just generate another malformed update to cause the session to reset
again.

While it may be possible to recover from something like this, it sure
would not be easy.

We experienced something like this a year ago on a couple of quagga boxes. At least we had source code to go through and resources to make use of that source code to find the problem and implement a quick work around. Its for situations like this, debugging logging is ooooohhh so important.

What did people do in this case to identify the issue ? Did you just pass it off to your vendor ? or did anyone try to diagnose it locally ? If so, what did you do ?

         ---Mike