Question about normal ops - BGP Flaps nightly

Christopher_Morrow · November 21, 2019, 9:45am

Howdy!
A question of interest to me, currently, is whether it's normal for
providers to cause BGP flaps to their customers nightly... This seems,
in my case, to be the provider PROBABLY updating prefix-filters on my
session(s).

Particularly AS56554 is currently getting v4/v6 transit from 2
providers, one of which we have 2 links toward. That provider appears
to flap both of our ipv6 (only) bgp peers each night at about the same
time each night. This smells like: "filter updates', but something
that's different than the v4 filter update? (or perhaps they have no
v4 filtering to update?)

In the end, should customers expect nightly (or on a regular cadence)
to see their sessions bounce? It hasn't been my experience in other
situations...

-chris

Jared_Mauch · November 21, 2019, 9:47am

This seems unusual, perhaps a bug in their tooling or their config where it’s doing a hard clear vs soft clear on the session?

- Jared

Christopher_Morrow · November 21, 2019, 9:57am

This was sort of my thinking, but I was unsure if there was some new
process and/or bug which other edge-y folk were dealing with of late.
I can/will ask the provider in question (a local apac provider) if
they are aware of the actions they are taking.

Mel_Beckman · November 21, 2019, 2:16pm

No. There should be no reason to bounce the session. Do you have soft updates turn on?

-mel via cell

Tom_Beecher · November 21, 2019, 4:54pm

I agree that this sounds like an automated process in some way.

I would suspect that either a vendor code update changed something such that a given command that would not cause session reset now does, or they changed their automation to include a command that would cause a reset without realizing it/slipped through the cracks / etc.

Baldur_Norddahl · November 21, 2019, 5:41pm

A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.

Regards

Baldur

tor. 21. nov. 2019 10.47 skrev Christopher Morrow <morrowc.lists@gmail.com>:

Saku_Ytti1 · November 21, 2019, 5:59pm

As there are best path algorithms which consider route age, BGP reset
impact may be indefinite.

Christopher_Morrow · November 22, 2019, 12:20am

I agree that this sounds like an automated process in some way.

I would suspect that either a vendor code update changed something such that a given command that would not cause session reset now does, or they changed their automation to include a command that would cause a reset without realizing it/slipped through the cracks / etc.

thanks to some private chat with another nanog participant it was
noted the reason for failure is:
"Error event Operation timed out(60) for I/O session - closing it"

This is fine, I suppose, except that I have v4/v6 sessions on the same
ptp link/path. So, if v6 times out I'd have expected v4 to also
timeout.
Strangely I had thought we were told the 2 links we have land on 2
different devices, but router-id tells me that's false as well.
The sessions appear to reset on both devices (according to syslog) at
the same time, I had thought (because our alerter is telling me) the
sessions had a gap between the 2 drops.

The physical payer is some bidi fiber path across an L2 (ether)
network to the provider, perhaps the problem isn't on the l3/bgp parts
here, but in the l2 network between. we are at the end of our time
here so I think I'll gather some logs and see if the provider can make
sense of the issues.

Christopher_Morrow · November 22, 2019, 12:21am

> A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.

As there are best path algorithms which consider route age, BGP reset
impact may be indefinite.

fortunately we have a second actual provider... so this all isn't
super impacting to us, just weird and unexpected on my part.

Baldur_Norddahl · November 22, 2019, 4:32am

No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:

ISP A has routes from both of your providers
ISP B has A as uplink

BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.

As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.

I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.

Regards,

Baldur

Christopher_Morrow · November 22, 2019, 4:39am

>
>
> > A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
>
> As there are best path algorithms which consider route age, BGP reset
> impact may be indefinite.

fortunately we have a second actual provider... so this all isn't
super impacting to us, just weird and unexpected on my part.

No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:

ISP A has routes from both of your providers
ISP B has A as uplink

BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.

As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.

Yup, I'm sensitive to flapping causing problems. This was why i
started the thread, which really should have been:
"Is there a well known bug people are working around? or is this a
new problem I should chase with the provider? or 'nah, everyone does
this, you just aren't normally paying attention'"

I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.

fortunately 3am local time is not prime-internet-use time phew!
(not a great excuse though, of course)

I'll be chasing up the provider to see what's up.
thanks!
-chris

Warren_Kumari · November 22, 2019, 4:48am

>
>
>
>>
>> >
>> >
>> > > A BGP reset can cause routing trouble for as much as 15 minutes. Since you have two sessions that mitigates the problem somewhat. But nevertheless this will not be acceptable.
>> >
>> > As there are best path algorithms which consider route age, BGP reset
>> > impact may be indefinite.
>>
>> fortunately we have a second actual provider... so this all isn't
>> super impacting to us, just weird and unexpected on my part.
>>
>
> No that is not helping. When the BGP session flaps your routes via that provider are withdrawn. Everyone out there that were using those routes will need to switch. But consider the following:
>
> ISP A has routes from both of your providers
> ISP B has A as uplink
>
> BGP works so that ISP A is only announcing the route that he is actually using to ISP B. ISP B therefore does not have both of your routes. When the active route is withdrawn ISP B will momentary be without any route to your network. It can take some time after the withdraw before ISP A announces that he now is using the alternative route. This gets worse with longer chains. Also some ISPs are using route flap limiting techniques that can prolong this process.
>
> As I said, my experience is that you can expect as much as 15 minutes of flaky internet after a BGP reset. This is with multiple transit providers.

Yup, I'm sensitive to flapping causing problems. This was why i
started the thread, which really should have been:
"Is there a well known bug people are working around? or is this a
new problem I should chase with the provider? or 'nah, everyone does
this, you just aren't normally paying attention'"

>
> I can not say too much about why you have BGP resets, but I can say that you really want it fixed. It will affect your connectivity.
>

fortunately 3am local time is not prime-internet-use time phew!
(not a great excuse though, of course)

The other saving grace / "meh" is that this is for a conference
network, and we are picking up sticks and leaving tomorrow... so, we
will let the provider know that there is something that should be
fixed, but a: our pain will have stopped and b: we won't really
have a good way to know if they have fixed the issue (other than
perhaps watching for a spike of withdraws / reannouncements every 24
hours through this AS path)

W

Yoni_Radzin · November 22, 2019, 2:33pm

”Someday we’ll find it: the stable connection, my providers, my routers and me.“ - Kermit the Frog

https://networkphil.com/

Couldn’t resist- having read this and then almost immediately seeing a link to the above “Rainbow Connection” remix via a LinkedIn post. Happy Friday!

Cheers,
Yoni

Mark_Tinka1 · November 27, 2019, 10:44am

A practical problem we've seen with Cisco's BGP-SD implementation is
that 0/0 and ::/0, when learned via BGP, are installed last.

So consider a situation where BGP flaps a session on IOS or IOS XE
running BGP-SD. Even though the full BGP table is being held in RIB only
(which can take about 10 minutes to fully download with the CPU
performance of, say, an ME3600X or an ASR920), a default route coming in
over an iBGP session will get loaded only after all more specific routes
have been installed and a best path algorithm ran against them.

If you write only default into FIB on these platforms, you're basically
blackholing traffic for as long as it takes for BGP to reconverge.

So yes, while the fundamental design for this by Cisco is inherently
flawed, unnecessary session resets are not ideal.

Mark.