Global BGP - 2001-06-23

Jared Mauch wrote:

> Can anyone verify whether Cisco still does BGP this way? (Propagate, then
> kill origionating session). If so, it rather clearly answers the question
> about how this managed to make it throughout the network...

  I'm fairly sure that is not the case anymore.

> (For the record: I'm not trying to Cisco-bash here. All vendors have
> problems, and when you have a huge market share, your problems tend to
> show up much more obviously, when they appear. However, Cisco does still
> have a huge market share, meaning this affected a whole lot of people,
> if true... so, I'm curious).

  From what I can tell this time it was not ciscos fault. It
appears that the vendor that had the problem just had an issue with
a specific "valid" announcement that others propogated to it.

All I can say is that the only report I have had about what caused the
whole mess to start was a Cisco BugID regarding a mangling done by some
IOS versions on a particular sort of route update that made it invalid
(or perhaps, "more invalid"). And if Cisco is no longer propagating routes
before shutting down the source session, then we're back to wondering how
this particular issue managed to cause flaps at the same time across at
least 5 "big player" networks that I've had reports about (including 3
by direct observation), at the same time. This person must have some pretty
impressive connectivity, if they managed to get what appears to be well
over a dozen routers at the absolute minimum, and more likely in the range
of "hundreds" if the rumor volume is at all accurate, to each display the
bug (since, if a bad announcement isn't propagated, it will never reach
anything but the direct peers; thus, this person would have to be directly
peered with every router that anyone saw flapping sessions to a customer).

Now, I'll grant, it would be possible to do this, but for them to have
hit just *our* network, they would have to be on 3 major carries in 3
states, including some places that a normal class B-type announcer just
isn't terribly likely to have a peering session.

  What is interesting is one could use this to see what
providers are using vendor "X" at exchange points.

Quite true. Though I suspect that in some cases, this might only tell you
what routing code they use. Making too many inferences is probably unwise.
Especially given the number of folks who thought they knew who "X" was,
only to state their guess and come out wrong...

Vendor X released a limited statement to their customers describing the
issue - and their view on it. The large incumbent vendor that we all
know and love has confirmed the issue, and released a "patch" to some of
their customers. Vendor X also went on to state that at no time did
their boxes crash, mis-forward, reset, or have any issue resulting from
the events of the past weekend.

From Vendor X's statement:

1. Another vendor's implementation of BGP contained a bug that caused
EBGP peers to leak
  CONFEDERATION information across AS boundaries, interpreted as
malformed AS_PATH

2. Vendor X's implementation of BGP-4 fully complies with the BGP-4
specification (RFC 1771) and
  accordingly, terminates a BGP session to a BGP peer who forwards
malformed AS_PATH

3. Unfortunately, this other vendor does not adhere to the standard in
the same manner and as a result,
  malformed AS_PATH announcements are propagated to other BGP peers. This
is contrary to
  RFC 1771. Vendor X believes that these vendors should modify their
implementation to adhere to
  the guidelines as stated per RFC 1771 (see section 6 - BGP Error

4. In light of the events of the past weekend and with input from a
number of the affected service
  providers (point #1 above), Vendor X has concluded that a review of our
BGP implementation is
  unnecessary at this time.

If you happen to be running Vendor X's software and think you may have
experienced the issue you can use the following to verify.

  ssh@vendor-x-119.chi03#sh ip bgp neighbor last

  BGP4: 86 bytes hex dump of packet received from neighbor that contains

  ffffffff ffffffff ffffffff ffffffff 00560200
  00003bff ffffffff ffffffff ffffffff ffffff00
  2d0104fd e8005ac0 a8803a10 02060104 00010001
  02028000 02020200 ffffffff ffffffff ffffffff
  ffffffff 001318d0 f20118d0 c10e18d0 f20018d0

(not speaking for Vendor X in any way shape or form. Just passing along
info that I was sent.)

Hash: SHA1

<sigh>... If the RFC jumped off a cliff...

- --
Matt Levine
ICQ : 17080004

- -----Original Message-----

The correct analogy is "If the RFC said 'stop drop and roll if you're on fire'".

Which, incidentally, *IS* approximately what it says.

Hash: SHA1

<sigh>... If the RFC jumped off a cliff...

Pointless and irrelevant. Do you follow the accepted standard or not -
that is what it comes down to. Bugs are bugs and everyone has them, big
deal. However, there is a general consensus about how things are
supposed to work - interoperability is somewhat difficult in this day
and age without it. So which is it? Follow the standards - be they RFC,
STD, draft, de facto, or de jure - or roll your own and pray?

No one has stated that closing the session is bad thing, and the general
feeling is that its a good thing. So what is it that you want?

(rambling on only for himself and not representing anyone else)

Hash: SHA1

What I would like is for my routers to not drop 4 of our 6 transit
providers, RFC, standard, not standard, whatever. We've suggested to
our vendor that there atleast be some option to control this, we are
not at the core, we are an end user. When following the RFC dictates
that our routing equipment loses connectivity to the internet, then I
say that there is a problem. It's really nice that they can say
"it's not a bug, it's a feature", but this is a feature I'd at the
very least have the ability to turn off.


- --
Matt Levine
ICQ : 17080004

- -----Original Message-----