Soliciting your opinions on Internet routing: A survey on BGP convergence

Hi NANOG,

We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
is it really? Do you care anyway? And can we (researchers) do anything about it?
Please help us out to find out by answering our short anonymous survey
(<10 minutes).

Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82>

** Background:

While existing fast-reroute mechanisms enable sub-second convergence upon
local outages (planned or not), they do not apply to remote outages happening
further away from your AS as their detection and protection mechanisms only
work locally.

Remote outages therefore mandate a "BGP-only" convergence which tends to be
slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
be propagated router-by-router. Our initial measurements indicate that it can
take state-of-the-art BGP routers dozens of seconds to process and propagate
these large streams of BGP UPDATEs. During this time, traffic for important
destinations can be lost.

** This survey:

This survey aims at evaluating the impact of slow BGP convergence on
operational practices. We expect the findings to increase the understanding of
the perceived BGP convergence in the Internet, which could then help
researchers to design better fast-reroute mechanisms.

We expect the questionnaire to be filled out by network operators whose job relates
to BGP operations. It has a total of 17 questions and should take less 10 minutes
to answer. The survey and the collected data are anonymous (so please do *not*
include information that may help to identify you or your organization).
All questions are optional, so if you don't like a question or don't know the answer,
please skip it.

A summary of the aggregate results will be published as a part of a scientific
article later this year.

Thank you so much in advance, and we look forward to read your responses!

Laurent Vanbever (ETH Zürich, Switzerland)

PS: It goes without saying that we would be also extremely grateful if you could
forward this email to any operator you might know who may not read NANOG.

Hello

I find that the type of outage that affects our network the most is neither of the two options you describe. As is probably typical for smaller networks, we do not have redundant uplinks to all of our transits. If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.

However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.

The only solution I know of is to have redundant links to all transits. Going forward I will make sure we have this because it is a huge disadvantage not being able to take a router out of service without causing downtime for all users. Not to mention that a router crash or link failure that should have taken seconds at most to reroute, but instead causes at least 5 minutes of unstable internet.

Regards,

Baldur

Hi NANOG,

We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
is it really? Do you care anyway? And can we (researchers) do anything about it?
Please help us out to find out by answering our short anonymous survey
(<10 minutes).

Survey URL: BGP Convergence: a survey <https://goo.gl/forms/WW7KX5kT45m6UUM82&gt;

** Background:

While existing fast-reroute mechanisms enable sub-second convergence upon
local outages (planned or not), they do not apply to remote outages happening
further away from your AS as their detection and protection mechanisms only
work locally.

Remote outages therefore mandate a "BGP-only" convergence which tends to be
slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
be propagated router-by-router. Our initial measurements indicate that it can
take state-of-the-art BGP routers dozens of seconds to process and propagate
these large streams of BGP UPDATEs. During this time, traffic for important
destinations can be lost.

One of the phenomena that is relatively easy to observe by withdrawing a
prefix entirely is the convergence towards longer and longer AS paths
until the route disappears entirely. that is providers that are further
away will remain advertising the route and in the interim their
neighbors will ingest the available path will until they too process
the withdraw. it can take a comically long time (like 5 minutes) to see
the prefix ultimately disappear from the internet. When withdrawing a
prefix from a peer with which you have a single adjacency this can
easily happens in miniature.

Dear Baldur,

I find that the type of outage that affects our network the most is neither of the two options you describe. As is probably typical for smaller networks, we do not have redundant uplinks to all of our transits. If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.

However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.

Thanks a lot for your input. Indeed, that case is a bit special. I’d say it is a kind of remote outage that remote ASes experience towards your prefix and, as such, requires a "BGP-only” convergence. I guess if your prefixes going via alternate transit are not visible at all prior to the switch (and I guess not), this is a kind of “extreme” convergence where routes have to be withdrawn/updated Internet-wide. This reminds me of the paper by Craig Labovitz et al. (http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-5-2.pdf) which I think classify these events as Tlong ("An active route with a short ASPath is implicitly replaced with a new route possessing a longer ASPath. This represents both a route failure and failover”). And indeed, these are the second slowest just before the withdraw of a prefix Internet-wide.

You’re right that our survey targets more the case in which large bursts of UPDATEs/WITHDRAWs are exchanged. I guess a parallel case to the one you mention could be that your prime transit performs a planned maintenance (or experiences a failure) that triggers the sending of WITHDRAWs for your prefixes out.

The only solution I know of is to have redundant links to all transits. Going forward I will make sure we have this because it is a huge disadvantage not being able to take a router out of service without causing downtime for all users. Not to mention that a router crash or link failure that should have taken seconds at most to reroute, but instead causes at least 5 minutes of unstable internet.

Maybe you could advertise better routes (i.e., with shorter AS-PATHs/longer prefixes) via the alternate transit prior to the take down? Ideally, if you could somehow make your primary transit switch to use an alternate transit prior to the maintenance (maybe with a special community?), you could completely avoid a disruption. This would go into the direction of minimizing the amount of WITHDRAWs in favor of UPDATEs. But, of course, this would only work in the case of planned maintenance.

We would definitely welcome more input on the convergence issue you face!

Best,
Laurent

Hi Joel,

Alternatively, if you reboot a router, perhaps you could first shutdown
the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
away (should be visible in your NMS stats), and then proceed with the
maintenance?

Of course this only works for planned reboots, not suprise reboots.

Kind regards,

Job

If a transit link goes, for example because we had to reboot a router,
traffic is supposed to reroute to the remaining transit links.
Internally our network handles this fairly fast for egress traffic.

However the problem is the ingress traffic - it can be 5 to 15 minutes
before everything has settled down. This is the time before everyone
else on the internet has processed that they will have to switch to
your alternate transit.

The only solution I know of is to have redundant links to all transits.

Alternatively, if you reboot a router, perhaps you could first shutdown
the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
away (should be visible in your NMS stats), and then proceed with the
maintenance?

Of course this only works for planned reboots, not suprise reboots.

...or link failures.

One other comment:

there has been a long history of poorly behaving BGP stacks that would
take quite some time to hunt through the paths. While this can still
occur with people with nearing ancient software and hardware still in-use,
many of the modern software/hardware options enable things like BGP-PIC
(in your survey) by default.

Many of these options you document as best practices like path mtu discovery
are well known fixes for networks, as well as using jumbo mtu internally to
obtain 9k+ mss for high performance TCP. Vendors have not always chosen to
enable the TCP options by default like the protocols have, eg: BGP-PIC and
like Jakob’s response, tout other solutions vs fixing the TCP stack first.

Many of these performances were documented in 2002 and are considered best
practices by many networks, but due to their obscure knobs may not be
widely deployed as a result, or seen as risky to configure. (We had a
vendor panic when we discovered a bug in their TCP-SACK code, they were
almost frozen in not fixing the code because touching TCP felt dangerous
and there was an inadequate testing culture around something seen as ‘stable’).

here’s the presentation from IETF 53, I don’t see it in the proceedings handily:

http://morse.colorado.edu/~epperson/courses/routing-protocols/handouts/bgp_scalability_IETF.ppt

- Jared

If I tear down my eBGP sessions the upstream router withdraws the
route and the traffic just stops. Are your upstreams propagating
withdraws without actually updating their own routing tables?

I believe the simple explanation of the problem can be seen by firing
up an inbound mtr from a distant network then withdrawing the route
from the path it is taking. It should show either destination
unreachable or a routing loop which "retreats" (under the right
circumstances I have observed it distinctly move 1 hop at a time)
until it finds an alternate path.

My observed convergence times for a single withdraw are however in the
sub-10 second range, to get all the networks in the original path
pointing at a new one. My view on the problem is that if you are
failing over frequently enough for a customer to notice and report it,
you have bigger problems than convergence times.

- Mike Jones