BFD for routes learned trough Route-servers in IXPs

Time-to-time, in some IXP in the world some issue on the forwarding plane occurs.

When it occurs, this topic comes back.

The failures are not big enough to drop the BGP sessions between IXP participants and route-servers.

But are enough to prejudice traffic between participants.

And then the problem comes:
“How can I check if my communication against the NextHop of the routes that I learn from the route-servers are OK?
If it is not OK, how can I remove it from my FIB?”

Some other possible causes of this feeling are:

  • ARP Resolution issues
    (CPU protection and lunatic Mikrotiks with 30 seconds ARP timeout is a bombastic recipe)
  • MAC-Address Learning limitations on the transport link of the participants can be a pain in the a…rm.

So, I was searching on how to solve that and I found a draft (8th release) with the intention to solve that…
https://tools.ietf.org/html/draft-ietf-idr-rs-bfd-08

If understood correctly, the effective implementation of it will depend on new code on any BGP engine that will want to do that check.
It is kind of frustrating… At least 10 years after the release of RFC until the refresh os every router involved in IXPs in the world.

Some questions come:
A) There is anything that we can do to rush this?
B) There is any other alternative to that?

P.S.1: I gave up of inventing crazy BGP filter polices to test reachability of NextHop. The effectiveness of it can’t even be compared to BFD, and almost kill de processing capacity of my router.

P.S.2: IMHO, the biggest downside of those problems is the evasion of route-servers from some participants when issues described above occurs.

So, I was searching on how to solve that and I found a draft (8th release)
with the intention to solve that...
draft-ietf-idr-rs-bfd-08

If understood correctly, the effective implementation of it will depend on
new code on any BGP engine that will want to do that check.
It is kind of frustrating... At least 10 years after the release of RFC
until the refresh os every router involved in IXPs in the world.

you have a better (== easier to implement and deploy) signaling path?

the draft passed wglc in 1948. it is awaiting two implementations, as
is the wont of the idr wg.

randy

“How can I check if my communication against the NextHop of the routes that I learn from the route-servers are OK? If it is not OK, how can I remove it from my FIB?”

Install a route optimizer that constantly pings next hops, when the drop threshold is met, remove the routes. No one is going to open BFD to whole subnets, especially those they don’t have peering agreements with, making this pointless.

  • ARP Resolution issues (CPU protection and lunatic Mikrotiks with 30 seconds ARP timeout is a bombastic recipe)

CoPP is always important, and it’s not just Mikrotik’s with default low ARP timeouts.

Linux - 1 minute
Brocade - 10 minutes
Cumulus - 18 minutes
BSD distros - 20 minutes
Extreme - 20 minutes
HP - 25 minutes

  • MAC-Address Learning limitations on the transport link of the participants can be a pain in the a…rm.

As you said, this issue doesn’t seem important enough to warrant significant action. For transport, colo a switch that can handles BGP announcements, routes, and ARPs, then transport that across with only 2 MACs and internal point-to-point IP assignments.

Ryan

Hi,

In some IXPs, getting a BFD protected BGP sessions with their
route-servers is possible. However, it is usualy optional, so there is
no way how to discover know who of your MLPA peering partners has their
sessions protected the same way and who don't.

You can also ask peers you have a session with to enable BFD there. If
they run carrier-grade border routes connected to IXP switches just with
fibers, it works pretty well.

So just try to talk with your peers about BFD.

or if you want a more reliable IXP experience, don't install a route optimiser and if you do, don't make it ping next-hops.

- you're not guaranteed that the icmp reply back to the route optimiser will follow the forward path.

- you are guaranteed that icmp is heavily deprioritised on ixp routers

- the busier the IXP, the busier the control planes of all the IXP routers you're going to ping, and the more likely they are to drop your ping packets. This will lead to greater route churn. If this approach is widely deployed it will lead to wider-scale routing oscillations due to control plane mismanagement.

- route optimisers are associated with serious bgp leakage issues. if you're doing this at an IXP, the danger is significantly magnified because bi-lat peering sessions rarely, if ever, implement prefix filtering.

It is true that IXPs occasionally see forwarding plane failures. These tend to be pretty unusual these days.

Be careful about optimising edge cases like this. You'll often end up introducing new failure modes which may be more serious and which may occur more regularly.

Nick

CoPP is always important, and it's not just Mikrotik's with default low
ARP timeouts.

Linux - 1 minute
Brocade - 10 minutes
Cumulus - 18 minutes
BSD distros - 20 minutes
Extreme - 20 minutes

Juniper - 20 minutes

I think you also mean to say: "this is actually still a DRAFT and not
an RFC, so really no BGP implementor is beholden to this document,
unless they have coin bearing customers who wish to see this feature
implemented"

So, I was searching on how to solve that and I found a draft (8th release)
with the intention to solve that...
draft-ietf-idr-rs-bfd-08

If understood correctly, the effective implementation of it will depend on
new code on any BGP engine that will want to do that check.
It is kind of frustrating... At least 10 years after the release of RFC
until the refresh os every router involved in IXPs in the world.

you have a better (== easier to implement and deploy) signaling path?

the draft passed wglc in 1948. it is awaiting two implementations, as
is the wont of the idr wg.

I think you also mean to say: "this is actually still a DRAFT and not
an RFC, so really no BGP implementor is beholden to this document,
unless they have coin bearing customers who wish to see this feature
implemented"

if i had meant to say that, i probably would have. no one on this
thread has called it anything other than a draft, so i am quite unsure
what your point is; and i will not put words in your mouth.

sadly, these years, vendors do not seem to care a lot about drafts,
rfcs, ... anything which sells.

randy

>>> So, I was searching on how to solve that and I found a draft (8th release)
>>> with the intention to solve that...
>>> draft-ietf-idr-rs-bfd-08
>>>
>>> If understood correctly, the effective implementation of it will depend on
>>> new code on any BGP engine that will want to do that check.
>>> It is kind of frustrating... At least 10 years after the release of RFC
>>> until the refresh os every router involved in IXPs in the world.
>>
>> you have a better (== easier to implement and deploy) signaling path?
>>
>> the draft passed wglc in 1948. it is awaiting two implementations, as
>> is the wont of the idr wg.
>
> I think you also mean to say: "this is actually still a DRAFT and not
> an RFC, so really no BGP implementor is beholden to this document,
> unless they have coin bearing customers who wish to see this feature
> implemented"

if i had meant to say that, i probably would have. no one on this
thread has called it anything other than a draft, so i am quite unsure
what your point is; and i will not put words in your mouth.

I think the OP said:
" At least 10 years after the release of RFC

>>> until the refresh os every router involved in IXPs in the world."

it's not an rfc yet.

sadly, these years, vendors do not seem to care a lot about drafts,
rfcs, ... anything which sells.

sure :frowning:

IOS - 4 hours

Why are these considered (by Ryan) low values? Does low have a
negative connotation here?

ARP timeout should be lower than MAC timeout, and MAC timeout usually
is 300 seconds. Anything above 300seconds is probably poor BCP for
default value, as defaults should interoperate in a somewhat sane
manner.
Of course operators are free to configure very high ARP timeout, as
long as they also remember to equally configure higher MAC timeout.

Time-to-time, in some IXP in the world some issue on the forwarding plane occurs.
When it occurs, this topic comes back.

The failures are not big enough to drop the BGP sessions between IXP participants and route-servers.

But are enough to prejudice traffic between participants.

And then the problem comes:
"How can I check if my communication against the NextHop of the routes that I learn from the route-servers are OK?
If it is not OK, how can I remove it from my FIB?"

If the traffic is that important then the public internet is the wrong
way to transport it. The internet has convergence times up to multiple
minutes. Failures can occur everywhere.
Reacting to these changes comes at a global cost.

Some other possible causes of this feeling are:
- ARP Resolution issues
(CPU protection and lunatic Mikrotiks with 30 seconds ARP timeout is a bombastic recipe)
- MAC-Address Learning limitations on the transport link of the participants can be a pain in the a..rm.

IXP can/do limit the participant port allowed MAC
IXP usually provide a sane config which includes ARP timeouts (which
can be checked and an ARP sponge helps as well)
The same goes for all the other multicast/broadcast protocols.

So, I was searching on how to solve that and I found a draft (8th release) with the intention to solve that...
draft-ietf-idr-rs-bfd-08

If understood correctly, the effective implementation of it will depend on new code on any BGP engine that will want to do that check.
It is kind of frustrating... At least 10 years after the release of RFC until the refresh os every router involved in IXPs in the world.

Some questions come:
A) There is anything that we can do to rush this?
B) There is any other alternative to that?

IXP are not simple L2 switches anymore, forwarding is done with
LACP/MPLS/VXLAN/... over multiple paths. When A and B can reach a
route-server it does not guarantee that A can reach B.
Using BFD between members might help or might not as you can not check
the complete topology below.

The IXP should use BFD and maybe even compare interface counters on
both sides of a link in their infrastructure.

@past dayjob: We monitored IXP health by pinging our peers/next-hops
every X minutes and alerted NOC when there would be bigger changes.
Like 10% of peers/next-hops that responded before stopped responding
to ICMP.

P.S.1: I gave up of inventing crazy BGP filter polices to test reachability of NextHop. The effectiveness of it can't even be compared to BFD, and almost kill de processing capacity of my router.

P.S.2: IMHO, the biggest downside of those problems is the evasion of route-servers from some participants when issues described above occurs.

route-servers caused some issues in the past like not propagating the
revocation/timeout of prefixes
some peers like a more direct relationship

If the traffic is that important then the public internet is the wrong
way to transport it.

Nonsense.

It is usually something said by those who do not know how to use Internet as a transport in a reliable way between two endpoints.

In your books what is Internet good for ? Torrent and porn ?

The internet has convergence times up to multiple minutes.

It does not matter how long does it take to “converge” any single path.

Hint: Consider using multiple disjoined paths and you see that for vast majority of “Internet failures” the connectivity restoration time would be very close to your RTT time between your endpoints.

Rgs,
R.

About this comparison between CAM-Table Timeout, and ARP-Table Timeout.
I tend to partially agree with you…

Ethernet is a so widely used protocol to sever scenarios.
We need to consider the different needs of the type of communications.

For example:
I’m not a big fan of Mikrotik/RouterOS.
But I know they are there, and liking or not, I need to accept that I will need to deal with then(as a peer or even as an operator).

One of most common uses of Mikrotik is for HotSpot/Captive Portal.
And for that, an ARP Timeout of 30 seconds is very OK!
Is a good way to check if the EndUser is still reachable on the network, and based on that do the billing.

But 30 Seconds for an IXP? It does not make any sense!
Those packets are stealing CPU cycles of the Control Plane of any router in the LAN.

Another example:
You suggested equalizing ARP-Timeout and MAC-Timeout
For a campus LAN? With frequent topology changes, add/removes of hosts every time…
That is perfect!

But talking about an IXP LAN:
In an ideal scenario, how often should happen topology changes on an IXP?
How often new hosts get ins/outs of hosts in the and IXP LAN?

Why should we spend CPU Cycles with 576K ARP Requests a day(2K participants, 5 min ARP-Timeout).
Instead of 1.2K ARP Requests a day(2K participants, 4 hours ARP-Timeout)?
I would prefer to use those CPU cycles to process other things like BGP messages, BFD, etc…

Well…
My idea with the initial mail was:

a) Check if there is anything hindering the evolution of this draft to an RFC.

b) Bet in try to make possible a thing that nowadays could be considered impossible, like:
“How to enable the BFD capability on a route-server with 2000 BGP Sessions without crashing the box?”

And maybe:
c) How about suggesting a standard best practice dor ARP-Timeout for IXPs.
And creating tools to measure the ARP-Timeout configurations of each participant, and make this info available trough standard protocols.

I think this communication may not be very communicative.

How many more BGP messages per day can we process if we do 1.2k ARP
requests a day instead of 576k? How many more days of DFZ BGP UPDATE
growth is that?

If you look just to the normal situations…
1.2K vs 576K may not represent very much.

But if you look tho ARP Requests Graphs on a significative topology changing on a big IXP, and also look to CPU-per-process graphs, maybe what I’m suggesting could be more explicit.

I’m talking of good boxes freezing because of that.
Of course CoPP exists to avoid that. But the vanilla configurations of CoPP combined with lunatic ARP-Timeout causes many day-by-day problems…

So, in this case, the solution would but a BCP with some "MUST"s defining acceptable rates.

And with that, every that doesn’t like to be waked up at dawn will become happy(at least by this reason).

Especially given how some exchanges lock the mac address of participants. You could probably get away with ARP timeouts of a day or even just permanent with manual clearing when you see a peer go down.

-Paul

a) Check if there is anything hindering the evolution of this draft to
an RFC.

was i unclear?

the draft passed wglc in 1948. it is awaiting two
implementations, as is the wont of the idr wg.

randy

Hello

ARP timeout should be lower than MAC timeout, but usually the default is the other way around. Which is extremely stupid. To those who do not know why, let me give a simple example:

Router R1 is connected to switch SW1 with a connection to server SRV: R1 <-> SW1 <-> SRV
Router R2 is connected to switch SW2 with a connection to server SRV: R2 <-> SW2 <-> SRV

The server is using R1 as default gateway. Traffic is arriving from the internet through R2 towards the server. The server will however send replies back through the default gateway at R1. This is a usual case with redundant routers - only one will be used as a default gateway but traffic may come from both.

Initially all will be good. But SW2 is only seeing unidirectional traffic from R2. No traffic goes from SRV to R2 and thus, after some time, SW2 will expire the MAC learning for SRV. This has the unfortunate result that SW2 will start flooding traffic to SRV out through all ports.

Then after more time has passed, R2 will renew the ARP binding by sending out an ARP query to SRV. The server will send back an ARP reply to R2. This packet from SRV to R2 will pass SW2 and thus have the effect of renewing the MAC binding at SW2 too. The flooding stops and all is well again. Until the MAC binding expires and the story repeats.

If the MAC timeout is 5 minutes and the ARP timeout is 20 minutes, which is very usual, you will have flooding for 15 minutes out of every 20 minutes interval! Stupid!

Why have vendors not fixed their defaults for this case?

Regards,

Baldur

Hi,

Douglas Fisher wrote:

B) There is any other alternative to that?

Don't connect to IXPs with very very large and complicated topologies. Connect to local IXPs where the design makes a forwarding plane failure that causes the problem you describe less likely.

Andy