RFC 1771's guidance on what to do with a announcement may have
been filled with good intentions, however the few times it has
been executed, it seems to create excessive harm to the Internet.
I propose it is time to revise RFC 1771.
RESOLVED, the error handling mechanism should be revised to reject
only the portion of the announcement in dispute instead of disrupting
the entire BGP session.
This would result in a small number of networks, most likely those
who originated the disputed announcement or are closely associated
with the announcer, being disrupted. But the portion of the Internet
with generally accepted announcements would not be affected. This limits
the colleteral damage from these disputed announcements to the small
number of networks in dispute.
The basic issue is one of scale vs integrity. However, I think this
particular case is one in which the RFC-dictated behavior is the
correct choice. The problem is that one [set of] router[s] did not
follow such behavior and thus escalated the scale of the problem
Given that the malformed route in question was most likely originated
from a single router, the only damage that should have been done was
a loss of routability for networks behind that one router. While of
course that could be arguably a significant number of networks, I
think it's a safe assumption that X losing its peers is pretty much
always a smaller impact than all of X's peers losing -their- peers.
If network XYZ's routers have N peers each, the RFC-dictated
behavior gives us N peering sessions lost (assuming the offending
route was advertised to all peers), instead of N^2 (or greater)
sessions as was the case.
I think the logic of dropping the session is sound. If a router
originates one malformed route, who's to say the rest of its routes
are correct? Perhaps other routes are corrupted, but not in ways
detectable by the router's sanity checks. Since the offending route
is indeed malformed, it's not unreasonable to stop trusting the
router from which it originated. Since it's likely only a single
router is originating the route, dropping sessions to that one
router controls the blast radius.
This is not to say that the issue of scale is unimportant. It most
certainly is. However, again, if the first router(s) to receive
the route had behaved properly, the scale of the problem would
have been small. The only place you'd see a flap of 100,000
routes is if the offending router was your upstream's. Everyone
else would only see (at most) a flap of the routes originated by
and/or behind that router (in BGP topology terms).
Perhaps a knob to control the behavior would be an acceptable
compromise for some. I think it's a bad idea for two reasons.
First, it allows bugs such as this to go unfixed, because when
it happens people just adjust the knob to keep their BGP sessions
stable. Second, it circumvents the integrity control. If a router
has many corrupted routes, but only a few trigger the sanity
checks for malformation, the session stays alive and the remaining
corrupted routes are then propagated network-wide. While this may
seem like a paranoid philosophy, a little paranoia can be good
when considering the integrity of the larger whole.
 = Yes, "likely" is a relative term. I know there are plenty of
cases where the same route is originated by multiple routers,
however the odds of more than one of them corrupting a route
at the same time are probably slim compared to the odds of
a single one doing so.
 = In this specific case, as I understand it, the direct peers did
in fact drop the offending BGP session, however they propagated
the offending announcement to their peers before doing so. In
this case, of course, the blast radius is not controlled.