[outages] Major Level3 (CenturyLink) Issues

the RFO is making the rounds
http://seele.lamehost.it/~marco/blind/Network_Event_Formal_RFO_Multiple_Markets_19543671_19544042_30_August.pdf

it kinda explains the flowspec issue but completely ignores the stuck
routes, which imiho was the more damaging problem.

randy

I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn’t address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?

Best regards,
Martijn

The more work the BGP process has the longer it takes to complete that
work. You could try in your RFP/RFQ if some provider will commit on
specific convergence time, which would improve your position
contractually and might make you eligible for some compensations or
termination of contract, but realistically every operator can run into
a situation where you will see what most would agree pathologically
long convergence times.

The more BGP sessions, more RIB entries the higher the probability
that these issues manifest. Perhaps protocol level work can be
justified as well. BGP doesn't have concept of initial convergence, if
you have lot of peers, your initial convergence contains massive
amount of useless work, because you keep changing best route, while
you keep receiving new best routes, the higher the scale the more
useless work you do and the longer stability you require to eventually
~converge. Practical devices operators run may require hours during
_normal operation_ to do initial converge.

RFC7313 might show us way to reduce amount of useless work. You might
want to add signal that initial convergence is done, you might want to
add signal that no installation or best path algo happens until all
route are loaded, this would massively improve scaled convergence as
you wouldn't do that throwaway work, which ultimately inflates your
work queue and pushes your useful work far to the future.

The main thing as a customer I would ask, how can we fix it faster
than 5h in future. Did we lose access to control-plane? Could we
reasonably avoid losing it?

❦ 2 septembre 2020 10:15 +03, Saku Ytti:

RFC7313 might show us way to reduce amount of useless work. You might
want to add signal that initial convergence is done, you might want to
add signal that no installation or best path algo happens until all
route are loaded, this would massively improve scaled convergence as
you wouldn't do that throwaway work, which ultimately inflates your
work queue and pushes your useful work far to the future.

It seems BIRD contains an implementation for RFC7313. From the source
code, it delays removal of stale route until EoRR, but it doesn't seem
to delay the work on updating the kernel. Juniper doesn't seem to
implement it. Cisco seems to implement it, but only on refresh, not on
the initial connection. Is there some survey around this RFC?

Correct it doesn't do anything for initial, but I took it as an
example how we might approach the problem of initial convergence cost
at scaled environments.

Sure, but I don’t care how busy your router is, it shouldn’t take hours to withdraw routes.

Quite, discussion is less about how we feel about it and more about
why it happens and what could be done to it.

I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers.

ons. 2. sep. 2020 14.11 skrev Saku Ytti <saku@ytti.fi>:

I can imagine writing BGP implementation like this

a) own queue for keepalives, which i always serve first fully
b) own queue for update, which i serve second
c) own queue for withdraw, which i serve last

Why I might think this makes sense, is perhaps I just received from
RR2 prefix I'm pulling from RR1, if I don't handle all my updates
first, I'm causing outage that should not happen, because I already
actually received the update telling I don't need to withdraw it.

Is this the right way to do it? Maybe not, but it's easy to imagine
why it might seem like a good idea.

How well BGP works in common cases and how it works in pathologically
scaled and busy cases are very different cases.

I know that even in stable states commonly run vendors on commonly run
hardware can take +2h to finish converging iBGP on initial turn-up.

Yeah. This actually would be a fascinating study to understand exactly what happened. The volume of BGP messages flying around because of the session churn must have been absolutely massive, especially in a complex internal infrastructure like 3356 has.

I would say the scale of such an event has to be many orders of magnitude beyond what anyone ever designed for, so it doesn’t shock me at all that unexpected behavior occurred. But that’s why we’re engineers ; we want to understand such things.

Cisco had a bug a few years back that affected metro switches such that they would not withdraw routes upstream. We had an internal outage and one of my carriers kept advertising our prefixes even though we withdrew the routes. We tried downing the neighbor and even shutting down the physical interface to no avail. The carrier kept blackholing us until they shut down on their metro switch.

creative engineers can conjecturbate for days on how some turtle in the
pond might write code what did not withdraw for a month, or other
delightful reasons CL might have had this really really bad behavior.

the point is that the actual symptoms and cause really really should be
in the RFO

randy

Sure. But being good engineers, we love to exercise our brains by thinking about possibilities and probabilities.
For example, we don’t form disaster response plans by saying “well, we could think about what could happen for days, but we’ll just wait for something to occur”.

-A

we don't form disaster response plans by saying "well, we could think
about what *could* happen for days, but we'll just wait for something
to occur".

from an old talk of mine, if it was part of the “plan” it’s an “event,”
if it is not then it’s a “disaster.”

I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities.

Anyway I am not saying it cannot happen because clearly something did happen. I just don’t believe it is a simple case of overload. There has to be more to it.

ons. 2. sep. 2020 15.36 skrev Saku Ytti <saku@ytti.fi>:

Detailed explanation can be found below.

https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/

❦ 2 septembre 2020 16:35 +03, Saku Ytti:

I am not buying it. No normal implementation of BGP stays online,
replying to heart beat and accepting updates from ebgp peers, yet
after 5 hours failed to process withdrawal from customers.

I can imagine writing BGP implementation like this

a) own queue for keepalives, which i always serve first fully
b) own queue for update, which i serve second
c) own queue for withdraw, which i serve last

Or maybe, graceful restart configured without a timeout on IPv4/IPv6?
The flowspec rule severed the BGP session abruptly, stale routes are
kept due to graceful restart (except flowspec rules), BGP sessions are
reestablished but the flowspec rules is handled before before reaching
EoR and we loop from there.

❦ 2 septembre 2020 16:35 +03, Saku Ytti:

>> I am not buying it. No normal implementation of BGP stays online,
>> replying to heart beat and accepting updates from ebgp peers, yet
>> after 5 hours failed to process withdrawal from customers.
>
> I can imagine writing BGP implementation like this
>
> a) own queue for keepalives, which i always serve first fully
> b) own queue for update, which i serve second
> c) own queue for withdraw, which i serve last

Or maybe, graceful restart configured without a timeout on IPv4/IPv6?
The flowspec rule severed the BGP session abruptly, stale routes are
kept due to graceful restart (except flowspec rules), BGP sessions are
reestablished but the flowspec rules is handled before before reaching
EoR and we loop from there.

... or all routes are fed into some magic route optimization box which
is designed to keep things more stable and take advantage of cisco's
"step-10" to suck more traffic, or....

The root issue here is that the *publicc* RFO is incomplete / unclear.
Something something flowspec something, blocked flowspec, no more
something does indeed explain that something bad happened, but not
what caused the lack of withdraws / cascading churn.
As with many interesting outages, I suspect that we will never get the
full story, and "Something bad happened, we fixed it and now it's all
better and will never happen ever again, trust us..." seems to be the
new normal for public postmortems...

W

It's possible Level3's people don't fully understand what happened or that the "bad flowspec rule" causing BGP sessions to repeatedly flap network wide triggered software bugs on their routers. You've never seen rpd stuck at 100% CPU for hours or an MX960 advertise history routes to external peers, even after the internal session that had advertised the route to it has been cleared?

To quote Zaphod Beeblebrox "Listen, three eyes, don't you try to outweird me. I get stranger things than you free with my breakfast cereal."

Kick a BGP implementation hard enough, and weird shit is likely to happen.

If only routers had feelings… Mark.