with a flap flap here and a flap flap there...

We were doing some testing. It involved testing advertising routes via a
new box. Simple enough; discontinue an advertisement of some unimportant
block from where it is normally advertised, start advertising it from the
other box; in the same AS, etc.

So I withdraw the first advertisement.

15 minutes later, I am still waiting for it to completely disappear from
the net. My reference points are route-views.oregon-ix.net (multihop
route-only peering to over a dozen ASes), route-server.cerf.net and
nitrous.digex.net. At this point it has flapped up to 7 times on some
paths.

We are AS6171. Right now, we only advertise routes to AS852. They
advertise to 1691, 577 and 1239.

An example advertisement that we were seeing (which flapped enough that it
was dampened!) came through the path (from route-views):

  3333 1103 3300 7018 6478 1 701 814 7189 1691 852 6171, (suppressed due to dampening)
    193.0.0.244 (inaccessible) from 193.0.0.242
      Origin IGP, metric 20, localpref 100, valid, external
      Dampinfo: penalty 2729, flapped 5 times in 00:03:32, reuse in 00:27:50

The paths for some of the routes got even longer than this as time went
on before they finally stopped being advertised. When it was all done,
and the route was finally withdrawn everywhere I could see, an example
history entry was:

  3333 6905 5623 1136 3300 7018 6478 1 701 814 7189 1691 852 6171 (history entry)
    193.0.0.244 from 193.0.0.242
      Origin IGP, metric 20, localpref 100, external
      Dampinfo: penalty 2952, flapped 7 times in 00:12:09

Am I the only one that has a problem with this? You withdraw a
route once, and it has flapped enough to be dampened in numerous
places. I could have been half convinced that we had gone back to
distance vector routing seeing this going on...

Right after it was finally withdrawn everywhere, I started advertising
the route again from the other box and did not have reasonable
complete connectivity until 45 minutes after the start of this
whole thing due to dampening.

In fact, the act of dampening it causes a significant amount more
flapping in some cases, although it certainly saves a lot in others.

While I'm not a complete moron and understand the basic concept
behind why this happens, I'm not really involved with Internet
backbone engineering or operations right now and I wasn't aware
that the Internet had grown enough to have this effect become so
significant...

Are there some networks around with very high advertisement intervals
that are making this effect more pronounced, or is the Internet just
that big now, or is something else going on?

Comments?

Marc Slemko <marcs@znep.com> writes:

  3333 6905 5623 1136 3300 7018 6478 1 701 814 7189 1691 852 6171 (history entry)
Am I the only one that has a problem with this? You withdraw a
route once, and it has flapped enough to be dampened in numerous
places. I could have been half convinced that we had gone back to
distance vector routing seeing this going on...

BGP *is* a distance-vector routing protocol.

What you were witnessing was a counting-to-infinity
problem that is bounded by timers and toplogical constraints.

Welcome to the wonderful world of BGP == RIP.

This is precisely why a mechanism which constrains
the amount of prefix transitions is so important:
magnification of flapping is predictable.

However, welcome to the wonderful world of lack of
filtering, too. There is no apparent reason why a number
of those ASes should have reannounced your routes to each
other. Each BGP peering (customer and peer alike) should
either explicitly allow announcements only from a list of
appropriate ASes, or explicitly disallow announcements
which have been through peers and (at least large) other
BGP-talking customers. This is an Incredibly Smart Thing To Do.
On a Cisco, the _ operator is your best regexp friend...

  Sean.

Marc Slemko <marcs@znep.com> writes:

> 3333 6905 5623 1136 3300 7018 6478 1 701 814 7189 1691 852 6171 (history entry)
> Am I the only one that has a problem with this? You withdraw a
> route once, and it has flapped enough to be dampened in numerous
> places. I could have been half convinced that we had gone back to
> distance vector routing seeing this going on...

BGP *is* a distance-vector routing protocol.

What you were witnessing was a counting-to-infinity
problem that is bounded by timers and toplogical constraints.

But the loop avoidance from having paths constrains you to the width of
the Internet in terms of ASes. It isn't completely a distance vector
protocol.

[...]

However, welcome to the wonderful world of lack of
filtering, too. There is no apparent reason why a number
of those ASes should have reannounced your routes to each
other. Each BGP peering (customer and peer alike) should
either explicitly allow announcements only from a list of
appropriate ASes, or explicitly disallow announcements
which have been through peers and (at least large) other
BGP-talking customers. This is an Incredibly Smart Thing To Do.
On a Cisco, the _ operator is your best regexp friend...

I guess I'm just suprised at how wide the Internet has grown and at the
lack of noticeable public acknowledgment of the resulting problems.

Marc Slemko <marcs@znep.com> writes:

But the loop avoidance from having paths constrains you to the width of
the Internet in terms of ASes. It isn't completely a distance vector
protocol.

Yakov Rekhter also made a similar observation.
I was caught up in rhetoric and was imprecise.
(I could also be wrong; it happens.)

BGP records paths and the loop avoidance scheme prevents
a count-to-infinity problem in a way slightly better than
RIP with split horizons does.

In a network like this:

              C
             /|
        A--B< |
             \|
         D

if A-B goes down in a split-horizons RIP network, there
can still be a count-to-infinity problem between C and D
announcing reachability to A.

BGP does not have this problem.

However, a withdrawal of a network from D can
cause A to see a transition from ABD to ABCD
to unreachable. This effect is what was complained
about in the message I initally followed-up to.

BGP is also not formally a distance vector protocol
because the AS_PATH attribute is a trail of breadcrumbs
rather than a distance. However, I argue that in common
practice, it is a distance, and offer up AS-path
prepending to affect path selection remotely as evidence.

I guess I'm just suprised at how wide the Internet has grown and at the
lack of noticeable public acknowledgment of the resulting problems.

I'm not so surprised by the first bit, and I think
I have become jaded about the second.

  Sean.