Did your BGP crash today?

Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?

Kim

I did see some attribute 99 stuff go around earlier today and have not yet researched it.

Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)

- Jared

If it in fact "crashed the internet", as opposed to "gave a few buggy routers
here and there indigestion", you wouldn't be posting to NANOG looking for
confirmation. :slight_smile:

https://www.ams-ix.net/statistics/

Not whole internet, but a part. And the "few buggy routers here and there" were mostly Cisco CRS-1's which didn't understand the new attribute and sent a malformed message to all peers, causing them to close the BGP session.

I think most of the impact was limited to Europe, especially Amsterdam area.

Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?

If it in fact "crashed the internet", as opposed to "gave a few buggy routers
here and there indigestion", you wouldn't be posting to NANOG looking for
confirmation. :slight_smile:

AMS-IX Amsterdam

Not whole internet, but a part. And the "few buggy routers here and there" were mostly Cisco CRS-1's which didn't understand the new attribute and sent a malformed message to all peers, causing them to close the BGP session.

In a way it remind me of the ASN4 bug .. Until a vendor fix is available I guess that the details are better left off public mailing lists.
http://www.uknof.org.uk/uknof12/Davidson-4_byte_asn.pdf

I think most of the impact was limited to Europe, especially Amsterdam area.

Yes, It had an effect on ISPs which are connected to RIS. http://www.ripe.net/ris/
AFAIK this mean ASes at LINX and AMS-IX . The LINX graph shows a similar (but smaller) dip of 50-60 GB.

Thomas

FYI:

So much for "better left off public mailing lists" ! sigh !

Thomas

sorry - found via google...

- Lucy

Not only. We don't peer with RIS, but about 8-10 our peers announce to us RIS. The nasty update we got from completely different AS, not RIS.

You may just check whether you see AS12654 - it is RIS.

Yes, the BGP message had a transitive attribute - sorry if I was not clear.
That said, you may want to ask why you are getting RIS routes if you are not peering with them directly :stuck_out_tongue:

RIS is peering world wide ( http://www.ripe.net/ris/docs/peering.html ) but the mail was only sent to linx-ops and tech-l, so the announcement may have been limited to europe (for all I know).

Thomas

Just out of curiosity, at what point will we as operators rise up
against the ivory tower protocol designers at the IETF and demand that
they add a mechanism to not bring down the entire BGP session because of
a single malformed attribute? Did I miss the memo about the meeting?
I'll bring the punch and pie.

Complain to your vendor, especially C & J are having good enough
influence on the IETF to make such a change possible.

I can agree with tearing the session down when one encounters an
improperly formatted message, but an unknown attribute, while the rest
of the format of message is fine, is a silly thing to hang up on indeed.

Greets,
Jeroen

I think it's actually an implementation problem where it got out-of-sync.

You can't exactly blame the IETF for a vendor having poor code quality.

(at least not in this case IMHO).

I seem to recall there was something like this in the past that caused
some significant problems with people also running XR/CRS-1. They quickly
got a fix and cisco issued a PSIRT as a result:

http://www.cisco.com/en/US/products/products_security_advisory09186a0080af150f.shtml#summary

I would hope these people updated their software for that impact as well.

Without knowing what the defect impact was on those devices, and without talking to
PSIRT today, I don't know if an advisory is pending. Perhaps it's a new defect
and the bug is going to be triggered again soon for those that don't patch
their devices.

- jared

When you are processing something, it's sometimes hard to tell if something
just was mis-parsed (as I think the case is here with the "missing-2-bytes")
vs just getting garbage. Perhaps there should be some way to "re-sync" when
you are having this problem, or a parallel "keepalive" path similar to
MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
happening.

- Jared

I know it wasn't there originally, and isn't mandatory now, but there is
an MD5 hash that can be added to the packet. If the TCP hash checks
out, then you know the packet wasn't garbled, and just contained
information you didn't grok. That seems like enough evidence to be able
to shrug and toss the packet without dropping the session.

-Dave

where's the change management process in all of this.
basically now we are going to starting changing things that can
potentially have an adverse affect on users without letting anyone know
before hand .... Interesting concept.

where's the change management process in all of this.
basically now we are going to starting changing things that can
potentially have an adverse affect on users without letting anyone know
before hand .... Interesting concept.

you are running bgp, you are connected to the 'internet'... congrats
you are part of the experiment.

I suppose one view is that "at least it wasn't someone with ill
intent, or a misconfigured mikrotek!"

(you are asking your vendors to run full bit sweeps of each protocol
in a regimented manner checking for all possible edge cases and
properly handling them, right?)

-chris

About the same time vendors' BGP implementations start to work correctly?

I agree such a knob would be useful, but seems to me that actually following the current standard would largely curb the issue by itself.

I recall one of the previous times something like this happened (and with a much wider impact), I believe it was $C that was accepting a bad attribute and passing it along. The effect was that other vendors ($F in particular, I think) would drop the session (per RFC), which made it look like they were the broken ones. Instead of saying "why was this accepted from its source?" the community reaction seemed more to me to be "hey, BGP is breaking the internet!"

If -everyone- dropped the session on a bad attribute, it likely wouldn't make it far enough into the wild to cause these problems in the first place.

-c

That works fine for malformed attributes. It blows chunks for legally formed
but unknown attributes - how would you ever deploy a new attribute?

About the same time the operators get back into the
IETF and become involved again. There was a time
when operators played a large role in the development
of things BGP (e.g. Tony Bates, Enke Chen, both at
iMCI).

No one is stopping us, the 'ivory tower' has no gate.

jy