Route Supression Problem

Unless useful to others, feel free to just reply off-list.

Background:

Tuesday (yesterday) morning around 1am, I got a phone call from one of my
transit customers(which seems more like a dream). I, sadly, didn't have the
router they are on logging to a server, so it's impossible for me to see
exactly what happened. Here's what I have. They received a minor spike in
traffic going to them. My router shows the last BGP peer reset about that
time, so this could be me sending the global table. His bandwidth then drops
to 0 for almost exactly 30 minutes (MRTG isn't an exactly graph). My guess
(authoratative answer) was the customer flapped their routes once too many
times and was suppressed by both of my providers, as I seem to recall the
penalty heal rate is in 30 minute increments.

First issue is, am I right? If I am, then I need to develop ways to limit
the damage done to my customer. Is there a way to setup route supression
just under what most people use so that I can have client fix the problem
and then clear the suppress on my network to allow them to come back up
immediately just under the suppress threshold? Another possibility, although
I've not seen reference to it, since the customer only transits through my
network and depends on my redundancy, is it possible to hold his routes in
the tables and keep advertising them out unless they are down for a set time
period (ie, ignore flaps, but drop them if he's down 15-30 minutes)?

I've never seen this issue. I was aware supression was possible when I first
started learning BGP, and so I have never risked bouncing my peers more than
three times in a day, and at that point usually quit playing until the next
week. When my peers flap due to DDOS attacks, BGP never stabalizes fully or
my providers have protected my networks (though I haven't seen how 69.8/18
will react in this scenario which doesn't have a shorter prefix at the
peer).

My customer is thinking of multi-homing again after this. Of course, it
wouldn't have saved the customer. The reason they left multi-homing is that
their network is in the same building and they only have one BGP router. I
don't think multiple paths would have saved them.

Opinions? Suggestions? Options?

-Jack

~We now return you to the 69/8 threads

traffic going to them. My router shows the last BGP peer reset about that
time, so this could be me sending the global table. His bandwidth then drops
to 0 for almost exactly 30 minutes (MRTG isn't an exactly graph). My guess
(authoratative answer) was the customer flapped their routes once too many
times and was suppressed by both of my providers, as I seem to recall the
penalty heal rate is in 30 minute increments.

Were there more flaps than just that last one before everything became
very quiet? A flap (up->down transition) has a penalty of 1000. By
default (if dampening is enabled), the dampen threshold is 2000. You
need at least three flaps to trigger dampening.

First issue is, am I right? If I am, then I need to develop ways to limit
the damage done to my customer.

Yell at your upstreams.

Is there a way to setup route supression
just under what most people use so that I can have client fix the problem
and then clear the suppress on my network to allow them to come back up
immediately just under the suppress threshold?

Dampening doesn't work on direct eBGP sessions: when the session is lost
the dampening info is removed from memory. So dampening your own
customers doesn't really do anything. For this reason, it seems curious
to me that both your upstreams use rather aggressive dampening. (See
RIPE-229 for some considerations on good dampening practices.)

Opinions? Suggestions? Options?

If this happens again you can simply reset your sessions to your
upstreams (one at a time of course) to get rid of the dampening IN THE
NEXT HOP AS. However, if the trouble is further upstream this only makes
matters worse.

traffic going to them. My router shows the last BGP peer reset about that

[...]

I've not seen reference to it, since the customer only transits through my
network and depends on my redundancy, is it possible to hold his routes in
the tables and keep advertising them out unless they are down for a set time
period (ie, ignore flaps, but drop them if he's down 15-30 minutes)?

While perhaps not always an ideal solution, is it possible for the
customer to set default to you rather than having to use BGP? You
could in turn use static routing back to them for their netblock(s).

John

you might want to look at <http://psg.com/~randy/021028.zmao-nanog.pdf>.
then again, you may not. it's depressing.

randy

You need at least three flaps to trigger dampening.

i guess you really need to look at that pdf.

randy

You are right, it is depressing. However, I don't see how the penalty
multiplication could happen here, you need a few hops in between for
that.

Iljitsch van Beijnum wrote:

> > You need at least three flaps to trigger dampening.

> i guess you really need to look at that pdf.

You are right, it is depressing. However, I don't see how the penalty
multiplication could happen here, you need a few hops in between for
that.

  Ah, but this is the Internet. Jack's two upstreams likely have direct
or indirect links between them where they will also receive the route
updates in question.
  Should we change the subject (back) to "BGP to doom us all"?

Peter E. Fry

(I believe I said that without swearing. $#&%*@!)

> You are right, it is depressing. However, I don't see how the penalty
> multiplication could happen here, you need a few hops in between for
> that.

  Ah, but this is the Internet. Jack's two upstreams likely have direct
or indirect links between them where they will also receive the route
updates in question.

Dampening is done on the eBGP router where the route enters the AS, and,
unless I'm mistaken, per route/path and not per prefix. So the flapping
that ISP A sees from ISP B is a completely seperate thing from the
flapping that ISP A sees from its customer's customer as far as the
dampening algorithm is concerned.

  Should we change the subject (back) to "BGP to doom us all"?

For all the criticism that BGP is subjected to, I find it curious that
nobody has proposed a replacement protocol (that I'm aware of).

"Better Algorithms" --

http://www.kotovnik.com/~avg/flap-rfc.txt
http://www.kotovnik.com/~avg/flap-rfc.ps

I didn't publish that one because I wanted to compare that with
penalty-based dampening on historical (pre-dampening) flap records, but
then got distracted by other projects. Preliminary data (from frequency
analysis) indicates that "unwarranted downtime" (defined as suppression
after the last flap prior to entering stable state) is reduced by a factor
of 3 to 4 compared with penalty-based algorithm tuned to produce the same
post-dampening flap rate.

--vadim