Time to revise RFC 1771

RFC 1771's guidance on what to do with a announcement may have
been filled with good intentions, however the few times it has
been executed, it seems to create excessive harm to the Internet.

I propose it is time to revise RFC 1771.

WHEREAS, the error handling of RFC1771 is invoked where there is
a dispute between two implementations of BGP.

WHEREAS, these disputes may be caused by any number of reasons,
including both implementations doing what they each believe are
reasonable actions.

WHEREAS, the resolution of these disputes may be time consuming,

RESOLVED, the error handling mechanism should be revised to reject
only the portion of the announcement in dispute instead of disrupting
the entire BGP session.

This would result in a small number of networks, most likely those
who originated the disputed announcement or are closely associated
with the announcer, being disrupted. But the portion of the Internet
with generally accepted announcements would not be affected. This limits
the colleteral damage from these disputed announcements to the small
number of networks in dispute.

RFC 1771's guidance on what to do with a announcement may have
been filled with good intentions, however the few times it has
been executed, it seems to create excessive harm to the Internet.

I propose it is time to revise RFC 1771.

[snip]

RESOLVED, the error handling mechanism should be revised to reject
only the portion of the announcement in dispute instead of disrupting
the entire BGP session.

This would result in a small number of networks, most likely those
who originated the disputed announcement or are closely associated
with the announcer, being disrupted. But the portion of the Internet
with generally accepted announcements would not be affected. This limits
the colleteral damage from these disputed announcements to the small
number of networks in dispute.

The basic issue is one of scale vs integrity. However, I think this
particular case is one in which the RFC-dictated behavior is the
correct choice. The problem is that one [set of] router[s] did not
follow such behavior and thus escalated the scale of the problem
significantly.

Given that the malformed route in question was most likely originated
from a single router, the only damage that should have been done was
a loss of routability for networks behind that one router. While of
course that could be arguably a significant number of networks, I
think it's a safe assumption that X losing its peers is pretty much
always a smaller impact than all of X's peers losing -their- peers.
If network XYZ's routers have N peers each, the RFC-dictated
behavior gives us N peering sessions lost (assuming the offending
route was advertised to all peers), instead of N^2 (or greater)
sessions as was the case.

I think the logic of dropping the session is sound. If a router
originates one malformed route, who's to say the rest of its routes
are correct? Perhaps other routes are corrupted, but not in ways
detectable by the router's sanity checks. Since the offending route
is indeed malformed, it's not unreasonable to stop trusting the
router from which it originated. Since it's likely[1] only a single
router is originating the route, dropping sessions to that one
router controls the blast radius[2].

This is not to say that the issue of scale is unimportant. It most
certainly is. However, again, if the first router(s) to receive
the route had behaved properly, the scale of the problem would
have been small. The only place you'd see a flap of 100,000
routes is if the offending router was your upstream's. Everyone
else would only see (at most) a flap of the routes originated by
and/or behind that router (in BGP topology terms).

Perhaps a knob to control the behavior would be an acceptable
compromise for some. I think it's a bad idea for two reasons.
First, it allows bugs such as this to go unfixed, because when
it happens people just adjust the knob to keep their BGP sessions
stable. Second, it circumvents the integrity control. If a router
has many corrupted routes, but only a few trigger the sanity
checks for malformation, the session stays alive and the remaining
corrupted routes are then propagated network-wide. While this may
seem like a paranoid philosophy, a little paranoia can be good
when considering the integrity of the larger whole.

-c

[1] = Yes, "likely" is a relative term. I know there are plenty of
      cases where the same route is originated by multiple routers,
      however the odds of more than one of them corrupting a route
      at the same time are probably slim compared to the odds of
      a single one doing so.

[2] = In this specific case, as I understand it, the direct peers did
      in fact drop the offending BGP session, however they propagated
      the offending announcement to their peers before doing so. In
      this case, of course, the blast radius is not controlled.

This ignores three basic facts:

1) Networks tend to be homogenous in platform.
2) Platforms tend to accept their own implementation quirks
3) Networks peer at borders

Therefore, under the "drop the session rule," my bad announcement
gets to all my borders fine, and all my external peers who are not
running forgiving/compatable implementations drop their connections
to me and all my traffic to/from them hits the floor.

One CRC error does not make PPP drop. Why make one route cause
a catastrophic loss of connectivity? Report the bad route,
drop it, and move on; let layer 8 resolve it.

-Dave

This ignores three basic facts:

1) Networks tend to be homogenous in platform.
2) Platforms tend to accept their own implementation quirks
3) Networks peer at borders

Therefore, under the "drop the session rule," my bad announcement
gets to all my borders fine, and all my external peers who are not
running forgiving/compatable implementations drop their connections
to me and all my traffic to/from them hits the floor.

In this case, vendor C's implementation was neither forgiving nor
compatible. It still dropped the peer(s) in question. It just had
the much more harmful quirk that it forwarded the bad route on to
its peers before doing so. In this case, a homogenous network would
not only lose its border sessions, it would lose all internal ones
through which the route was advertised.

One CRC error does not make PPP drop. Why make one route cause
a catastrophic loss of connectivity? Report the bad route,
drop it, and move on; let layer 8 resolve it.

Because, arguably, we don't know that it's just one route. We just
know that one route set off the alarm. Do you feel safe assuming that
whatever bug caused one corrupted route left all the other routes
alone?

Plus, a CRC error can occur between two valid, compliant, bug-free
implementations. A bad route, by definition, can't. We're not talking
about external faults here, but broken implementations. When one side
of a protocol session simply breaks the rules, I don't think it's
reasonable to say that the other side needs to be "fixed" to accept
that breakage. Fix the broken side.

The reason this has got everyone's attention is because of the unique
way in which the breakage occurred. If all implementations were changed
to drop the single bad route and keep the sessions intact, the damage
would not have been what it was. If all implementations followed the
current specs and dropped the session with the router which first
originated the bad route, the damage would not have been what it was.
To say that one way causes massive damage and the other doesn't is
inaccurate. The damage was caused by the implementation in question
doing something resembling one but with harmful behavior thrown in.

-c

>
> This ignores three basic facts:
>
> 1) Networks tend to be homogenous in platform.
> 2) Platforms tend to accept their own implementation quirks
> 3) Networks peer at borders
>
> Therefore, under the "drop the session rule," my bad announcement
> gets to all my borders fine, and all my external peers who are not
> running forgiving/compatable implementations drop their connections
> to me and all my traffic to/from them hits the floor.

In this case, vendor C's implementation was neither forgiving nor
compatible. It still dropped the peer(s) in question. It just had
the much more harmful quirk that it forwarded the bad route on to
its peers before doing so. In this case, a homogenous network would
not only lose its border sessions, it would lose all internal ones
through which the route was advertised.

I'm certainly not defending (or attacking) either vendor's
implementation; in the current environment, I believe following
the RFC is the correct course. I was more concerned with
future implementations of BGP, and how (I feel) they should handle
problems like this, since, as we add more and more features to
BGP, how we handle what appears to be a bad route (or a bad
NLRI) is going to become more important.

> One CRC error does not make PPP drop. Why make one route cause
> a catastrophic loss of connectivity? Report the bad route,
> drop it, and move on; let layer 8 resolve it.

Because, arguably, we don't know that it's just one route. We just
know that one route set off the alarm. Do you feel safe assuming that
whatever bug caused one corrupted route left all the other routes
alone?

No, but I feel secure that, if it corrupted a large enough number of
routes, the effect will not be worse than dropping the session.
Somebody mentioned what happens if there are 100,000 bad routes and 1
good one. You keep the good one and drop the 100,000 bad ones.
Dropping routes is even easier than using them. Besides, which tends
to be harder on a router: dropping bad routes, or tearing down and
restarting a TCP session?

Plus, a CRC error can occur between two valid, compliant, bug-free
implementations. A bad route, by definition, can't. We're not talking
about external faults here, but broken implementations. When one side
of a protocol session simply breaks the rules, I don't think it's
reasonable to say that the other side needs to be "fixed" to accept
that breakage. Fix the broken side.

A "bad route" can happen whenever one implementation differs from
another. Both can be valid according to some definition of the
standard. Determining who is wrong, and fixing it, takes time. If
you're dropping a few of my routes during that time, that's
unavoidable. If every customer of mine cannot reach every customer of
yours while we fight over whose implementation is wrong and who needs
to change what, then who wins? And how is this fight more legitimate
than the one you have with your telco provider over how they built
your circuit and where your errors are coming from?

The reason this has got everyone's attention is because of the unique
way in which the breakage occurred. If all implementations were changed
to drop the single bad route and keep the sessions intact, the damage
would not have been what it was. If all implementations followed the
current specs and dropped the session with the router which first
originated the bad route, the damage would not have been what it was.
To say that one way causes massive damage and the other doesn't is
inaccurate. The damage was caused by the implementation in question
doing something resembling one but with harmful behavior thrown in.

I think the issue has gone beyond what happened, and into what will
happen. It's a simple design philosophy question: Do you build
protocols that are robust and resilient under stress, or do you
build protocols that refuse to interoperate until everything
completely agrees? Ideally, I can see the beauty of the second,
but realistically, I think you need to be permissive.