rfd

Bandy_Rush1 · December 18, 2018, 5:40pm

do you have rfd on? with what parms?

randy

Job_Snijders3 · December 18, 2018, 5:43pm

I assume rfd in this context means "Route Flap Dampening".

NTT / AS 2914 does *not* have Route Flap Dampening configured, as is
documented here
https://us.ntt.net/support/policy/routing.cfm#routedampening

Kind regards,

Job

Andrew_Latham · December 18, 2018, 5:45pm

Route Flap Damping via https://tools.ietf.org/html/rfc2439 for everyone.

Mark_Tinka1 · December 18, 2018, 6:45pm

We don't do it (SEACOM, AS37100).

Mark.

Mark_Tinka1 · December 18, 2018, 6:45pm

We don't do it (SEACOM, AS37100).

Mark.

Saku_Ytti1 · December 18, 2018, 7:00pm

I always wondered why does it have to be so binary.

I don't want to decide for my customers if partial visibility is
better than busy CPU, but I do appreciate stability. Why can't we have
local-pref penalty for flapping route. If it's only option, keep
offering it, if there are other, more stable options, offer those.

Jared_Mauch · December 18, 2018, 7:01pm

Similarly 20940 does not use it. I find it hard to see a case where we would turn it on.

- jared

Naslund_Steve · December 18, 2018, 8:55pm

Mainly because propagating a flapping route across the entire Internet is damaging to performance of things other your own equipment and that of your customer. It is just "bad manners" to propagate a flapping route to your peers and it helps maintain a minimum level of stability that it required to keep you "on the Internet". Imagine a table where 1000s of providers are each sending 100s of unstable routes and that those unstable routes might be redistributing into various IGPs that may not respond very gracefully to rapid table changes (like most distance vector IGPs). Also think of this scenario, your link to your customer might be flapping but that same customer might have other carriers advertising the same address space over a stable link. In that case you would be doing a dis-service by not withdrawing that route and having a local-pref does not help since you don't necessarily have visibility to all of your customers other carrier networks.

You do have the ability to clear the RFD timers for a route if you need to manually intervene for example when you know for a fact that you fixed the problem. That means that if no one is watching or intervening the network will "do the safe thing".

Steven Naslund
Chicago IL

Job_Snijders3 · December 18, 2018, 8:59pm

Hi Steve,

Lowering the LP would achieve the outcome you desire, provided there are (stable) alternative paths.

What you advocate results in absolute outages in what may already be precarious situations (natural disasters?) - what Saku Ytti suggests like a less painful alternative with desirable properties.

Kind regards,

Job

Naslund_Steve · December 18, 2018, 9:24pm

Remember always that the local pref is just that, YOUR local preference. Sending that flapping route upstream does not give your peer the option to ignore it. In any case, the downside is that you have to process that route and then choose whether or not to use it. It’s like saying “now that you have processed this unstable route and burned your CPU cycles, I am now giving you to option not to install it into your table”. Remember also that we are only talking about default behavior here. You always have the option to override it by changing timer, penalties, or shutting down RFD all together. We are only talking about day-to-day operation here.

Also, keep in mind that when we are talking about alterative stable paths we are only talking about what your network sees, not the entire Internet. If you as a service provider are experiencing major issues, you may see a route to me as stable or unstable but making global routing decisions based on that is not sound. What might be best for your customer or your business might not be best for the Internet community as a whole. It is a matter of scale, how many services providers can allow how many unstable routes before the entire network becomes regionally or globally unstable. It’s important to remember that flapping routes leave a certain amount of data in flight with no destination which is detrimental to overall performance. As we move into a V6 world we are again worried about the size of the global routing tables and pushing routing performance. Instability of routes is dangerous to system running near the limits. Propagating a known unstable route would be a major shift in routing policy. Today, you either say you can reach something or you don’t say anything. Using the suggested alternative adds the option of “I might be able to reach this but not reliably” which then brings about metrics of “how reliably?” and that is a huge shift in how global routing works. We have been struggling with a backbone routing protocol that does not really do a good job of understanding bandwidth and multiple paths so I would suggest that adding “maybe” routes is not a good idea.

At least using RFD you can explain to your customer why they are not reachable rather than explaining how you made a manual decision to dump them for the “good of the Internet”. There is also a business penalty to the service provider that exposes instability to network. People don’t want to peer or send traffic through unstable network regions.

Steve

Mark_Tinka1 · December 18, 2018, 9:32pm

What would really be of interest to me would be for those that run RFD to measure its impact to their network (positive or otherwise) so we have something scientific to base on.

The theory (and practice of old) tells us that RFD is either very good, or very bad. There are probably more folk that have turned it off than run it, or vice versa. Ultimately, if we can get the state of RFD’s performance in 2018 on an axis, our words will likely carry more weight.

Mark.

_Job_Snijders · December 18, 2018, 9:54pm

Dear Steve,

No worries, I have not forgotten the transitive properties of the
LOCAL_PREF BGP Path Attribute! You are right that any LOCAL_PREF
modifications (and the attribute itself), are local to the Autonomous
System in which they were set, but the effects of such settings can
percolate further into the routing system.

A great example is the "BGP Graceful Shutdown" mechanism (science
partially documented in https://tools.ietf.org/html/rfc6198, actual
specification here https://tools.ietf.org/html/rfc8326). What is
interesting is that by considering a path (any path, could be flapping)
my network will propagate alternative paths to my neighboring networks,
or possibly even *withdraw* my announcement in favor of alternative
(stable?) paths via competitors.

By attaching a lower LOCAL_PREF value to a given path for a period of
time as a 'penalty' for flapping, I suspect the visiblity of that
flapping will be greatly reduced. This of course doesn't hold true when
the only origin of the path is flapping, but in many flapping cases I
triaged it was clear that only one out of many links was the root of the
flapping.

I'm not sure I share your concerns about scale, it appears that so far
we seem to be doing just fine without "route flap dampening, penalty
type: suppress". No customers ask for it, in fact many are relieved we
don't use it. None of our peering partners ask for it either. When we
see oscillating paths we reach out to the offending party and ask them
to fix it, or take unilateral action within a specific time frame.

Kind regards,

Job

Naslund_Steve · December 18, 2018, 10:01pm

I think you will find that very hard to evaluate since the value of RFD will be different in different network regions. For example, it is probably good practice to run RFD toward a customer on an unstable access link. It might not be a good idea to run it on a major backbone link that could possibly flap a large number of times in a very short period due to something like a maintenance activity. Also, in areas that are largely on a fiber infrastructure will see RFD in a much different light than a largely wireless infrastructure that might be subject to momentary interference or interruptions. I think it is most safe to say that RFD needs to be evaluated and tuned for what you want it to do. Penalties are never a pleasant thing but they prevent lawlessness. That is exactly what RFD does. You are the cop that decides how to enforce the laws.

In fact in my experience people could also get much better network performance overall by properly tuning BGP timers but very few actually do it. I bet you could improve the Internet stability way more by doing that.

Steven Naslund

Chicago IL

Naslund_Steve · December 18, 2018, 10:10pm

I will grant you that no customer ever asked for route dampening. I also realize that RFD is much less important now than in the past. I come from the ARPANET/DDN ages of the Internet and can tell you that RFD was absolutely critical in the days of very under powered routers and very unstable data links. I remember when it was quite hard to maintain a 64k link to some locations at all. There might be less of a need for such a simple RFD but it did serve its purpose. In fact, my main argument on this whole topic is that RFD is not relevant enough to waste a lot of effort on a global accepted mechanism. It is just not the low hanging fruit of routing performance improvements. I see two major improvements to global routing...congestion avoidance (which goes a little bit with bandwidth awareness but not exactly) and multipath load balancing (which kind of requires a congestion avoidance awareness). Both of these are going to be extremely difficult issues on a global scale of adoption but that's what is needed.

Steven Naslund
Chicago IL

stillwaxin · December 19, 2018, 5:29am

In general I agree with the idea here but I would also be interested in the possibility of running the local route policy engine against routes that are locally detected to meet a damping condition (user configureable of course). This would potentially yield the ability to change local_pref as well as other attributes that may be useful such as MED/metric (which can be transitive) and/or communities.

adamv0025 · December 27, 2018, 4:33pm

Randy Bush
Sent: Tuesday, December 18, 2018 5:40 PM

do you have rfd on? with what parms?

randy

If I remember correctly the industry was back and forth on this several
times now.
First it was deemed good then some studies came out proving the penalty is
worse than the crime couple years later another study came out suggesting
that if correct parameters are used it should be alright, but I guess at
that time no one could have cared less already switching it on and off and
on again...

With regards to the comments made here on the number of unstable routes till
the whole system or significant parts collapse, I could easily revert that
argument and ask how many badly configured rfd till the whole system
shuts/dampens itself down... (positive vs negative feedback loop) I guess
the ideal solution is somewhere in between.

Personally I think rfd is just the aspirin, i.e. not treating the cause -but
merely helping with the headaches.
And I suspect that Interface State Dampening would address 80% of the
route-flaps out there (it works exactly like rfd but treats the cause).
With the reminder being true protocol flaps either by misconfiguration of
max prefix limit (sessions should stay down) or BGP error handling -which
again can be solved by the enhanced BGP error handling or genuine bugs.

adam