RE: The Gorgon's Knot. Was: Re: Verio Peering Question

Sean_M_Doran · September 28, 2001, 10:44pm

But, we all do, or we aren't talking BGP. The requirements here are not that
large. A Cisco 2651 with 128mb is a valid BGP speaker, these days. That's a
cheap router, indeed. And, router memory is dirt cheap.

BGP is based on TCP and thus has the fun property that a big
set of changes will pile up in front of a connection to a peer
that is slow at processing inbound announcements & withdrawals.
The slower you are at processing updates, the more likely you
are to be out of sync with reality in such a way that you will
begin to notice that you are forwarding some packets the wrong
direction into loops or black holes. The slower you are, the
greater the backlog you have to chug through to catch up,
making you busier for longer periods, which in turn leads to
greater backlogs. Slow down too much and the other side
will help you out by resetting the session.

We've seen this in the past - it's caused MASSIVE outages
affecting nearly EVERYONE for hours at a time.

Or you can say "smd is protecting his own personal interests"
and carry on arguing the equivalent of "ANYBODY can build
a modern router using a sufficient amount of ROM" which simply
underlines the point that dynamic global routing is an expensive
luxury that many people have gotten used to.

The common good is
promoted by allowing these folks to multihome, which would be effectively
prohibited if all networks implimented verio-style filter policies.

Think of it as a catalyst for more experimentation with alternative
ways of multihoming without the use of BGP. There are several which
exist now, and several which are being discussed in multi6 which
could be made to exist now without universal software changes.
Some brainstorming could result in several other approaches, more
or less generalized, but what's the point when the normal cheap-seeming
thing to do is to announce CIDR holes to the world?

The number of folks who multihome is large and growing. We should support
this by promoting relatively open filtering policies and allowing /24s to be
truly, globally routable.

I think we should encourage people to introduce individual /32s
into the network and flap them around a bit, to force some issues
which have been avoided becauase first Sprint and then Verio have
been willing to take a bunch of negative PR in the act of self-protection
(which has the side-effect of protecting alot of people who generate
the negative PR, and everyone else).

Sean.

E.B_Dreger · September 28, 2001, 11:17pm

Date: Fri, 28 Sep 2001 15:44:38 -0700 (PDT)
From: Sean M. Doran <smd@clock.org>

[ snip ]

I think we should encourage people to introduce individual /32s
into the network and flap them around a bit, to force some

I've not seen anyone suggest allowing longer than /24 in this
thread. However, I'll definitely admit that, with name-based
hosting, some webhosts most certainly could want to announce long
prefixes.

issues which have been avoided becauase first Sprint and then
Verio have been willing to take a bunch of negative PR in the
act of self-protection (which has the side-effect of protecting

So allow le 24 at the border. Allow le <whatever> internally,
and tag so it doesn't redistribute. Apply appropriate dampening.

alot of people who generate the negative PR, and everyone
else).

I guess that someone who never hears a route is certainly safe
from flappage.

I guess that we can:

1. Continue arguing over right/wrong (nanog-l as a whole; I'm not
_quite_ crazy enough to try taking on Sean publicly *grin*)

2. See which approach works in the long run (the network that
dies with the most money wins)

3. Establish guidelines on what is "acceptable" table size, CPU
utilization, etc., and then decide how to get there.

Consider that, with providers being pushed to use name-based
hosting, NAT, etc., it's very desirable to "basement multihome".

<conspiracy_theory>
Are big providers so desparate for business that the want to
prevent customers from multihoming, attempting to be the sole
vendor of bandwidth?
</conspiracy_theory>

All that said, I _do_ favor IP allocation based on region. Say
I connect to KSCYMO, which connects to CHCGIL or DLLSTX... IP
allocation would be from a sub-ARIN entity in one of those
regions. Make space portable between providers...

Wait a second. All of this sounds vaguely familiar...

Eddy

Valdis_Kletnieks · September 29, 2001, 1:51am

Oh, that one's EASY.

The global routing table is hereby capped at 125K routes. After that,
if you want a route, you have to pay somebody to give up theirs.

Problem solved

This will have some advantages - it will make companies that want to multi-home
calculate the actual benefit of doing so ("we should multihome" becomes "it would
cost an estimated $nnK a year in downtime/unreachability/lost sales") so they
know how much they want to bid for a routing table entry. For many companies,
it may not actually make as much business sense to multihome as they thought.

ISPs will have a new thing to market - premium services to enhance reliability
and uptime without a route announcement (more aggressive marketing of multihoming
to 2 POPs of the same ISP for a discount off the normal price for 2 pipes?)

In the dot-bombed crash, a large number of companies will probably be willing
to sell off their route for a quick infusion of cash.

route squatters will probably not be as big an issue as domain squatters.

Disadvantages? ARIN and company are unpopular enough without acting as
a commodity trade market for buying and selling routes.

And the SEC will of course be on the lookout for insider trading in route futures -
expect investigations the first time somebody shorts on a future.

It would be a strange new world - but at least the routing table wouldn't be
growing.

/Valdis

ianai · September 29, 2001, 2:09am

There are many prefixes which are demonstrably useless in the global table (e.g. most of 7046). If you are really interested in reducing the size of the table, why not go after those? Instant gratification and all that.

But I thought the problem is not *size* of the table, but *changes* in the table. With zero updates, a BGP table of double the current size would not bother any core router on the 'Net today.

So, let's work on things like progressive flap dampening. I believe you suggested this before, Sean?

Sean_Donelan · September 29, 2001, 2:12am

The global routing table is hereby capped at 125K routes. After that,
if you want a route, you have to pay somebody to give up theirs.

Of course, we could adopt geographic allocations. North American is
still working on (+1) in e.164 space. We could shrink the global
route table to a few thousand routes.

Sean_Donelan · September 29, 2001, 2:40am

But I thought the problem is not *size* of the table, but *changes* in the
table. With zero updates, a BGP table of double the current size would not
bother any core router on the 'Net today.

The global telephone table takes months to change, would that be slow
enough?

Do networks worldwide really need to know everytime a router in East
Nowhere reboots?

So, let's work on things like progressive flap dampening. I believe you
suggested this before, Sean?

Flap dampening measured by the symptom is fine. The more you flap, the
more you are dampened. But as at least one group of academics found, the
length of the prefix is not related to its propensity to flap. Proposing
a /24 should be dampened for a longer period, for the same number of
flaps, than a /8 is misguided.

Alex_Bligh1 · September 29, 2001, 10:09am

We have this at a continental level. At less than a continental
level the argument against this is that at lower distances there
is a poorer and poorer map between geographic proximity and
(network) topological proximity. Pick any major US city without
a popular peering point / private peering for a trivial example.

Alex Bligh
Personal Capacity

Iljitsch_van_Beijnum · October 3, 2001, 9:02am

> I think we should encourage people to introduce individual /32s
> into the network and flap them around a bit, to force some

I've not seen anyone suggest allowing longer than /24 in this
thread. However, I'll definitely admit that, with name-based
hosting, some webhosts most certainly could want to announce long
prefixes.

Filtering on prefix size is a pretty absurd idea. A network is not
automatically unimportant because it has few addresses. A.ROOT-SERVERS.NET
has a single address and www.cnn.com several within something that could
be a /25 and a /27. I sure want to be able to reach those as effeciently
and reliably as possible. Why should they announce 4000 extra unused
addresses just to avoid filtering?

On the other hand, filterers do have a point: why are there so many /24s
in the global routing table? But then again, this also happens to a lesser
degree for larger blocks. Have a look at the 24.x.y.z space, this is
pretty ridiculous.

Obviously, some networks don't care about the size of the routing table
and announce hundreds of routes. Other networks do, and filter the easy
targets. (And some networks manage to fall into both categories.) The
result being that a group that didn't cause the problem suffers and the
problem is not really solved.

3. Establish guidelines on what is "acceptable" table size, CPU
utilization, etc., and then decide how to get there.

I don't think this is going to happen. Even if we can agree on these
things _today_, everybody has a different view of what is going to happen
in the future and how we should prepare for that.

<conspiracy_theory>
Are big providers so desparate for business that the want to
prevent customers from multihoming, attempting to be the sole
vendor of bandwidth?
</conspiracy_theory>

I don't think they are actively doing this, but if filtering is "sound
engineering" and it happens to make life harder for a lot of those
annoying small compitors, well, they can't help that, can they?

All that said, I _do_ favor IP allocation based on region. Say
I connect to KSCYMO, which connects to CHCGIL or DLLSTX... IP
allocation would be from a sub-ARIN entity in one of those
regions. Make space portable between providers...

Wait a second. All of this sounds vaguely familiar...

The problem with this and many other good ideas is that they can only work
well if they are widely adopted. And there are always people who have
reasons (legitimate or otherwise) why they want another solution or keep
things as they are.

But I agree that some form of regional filtering and/or addressing could
be beneficial. I live 30 miles from a major interconnect point. I would
rather have 30k prefixes up to /24 or even larger that are reachable over
this exchange point in my routing table and have a default for the rest
of the world, than run full routing but only for RIR assigned blocks. But
then, I buy transit so I don't have to be defaultless. But a defaultless
network could accept large prefixes at exchange points but keep them local
and only propagate RIR block filtered routes throughout the network. This
would work better if routes were colored with information about their
origin region, though.

Even better would be if the RIRs would divvy up the world in 10 - 20
regions, and allocate a /8 - /10 to each. That way, the routers don't have
to know all individual routes to some remote region, but they can
simply forward the traffic to a part of the network that does know the
region-specific routes.

If anybody bothers to reply to this, you will see that there are numerous
reasons why this isn't "the" solution. However, it may help some people
some of the time, and it doesn't impact those who don't want to use it.
And, more importantly: it doesn't require universal cooperation. Just the
RIR's.

Iljitsch van Beijnum

E.B_Dreger · October 3, 2001, 2:42pm

Date: Wed, 3 Oct 2001 11:02:44 +0200 (CEST)
From: Iljitsch van Beijnum <iljitsch@muada.com>

Filtering on prefix size is a pretty absurd idea. A network is

[ snip anecdotes: a.root-servers.net & www.cnn.com ]

And eBay's /24, /23, and /22 blocks; see one of my earlier posts.
Yes, I agree... I probably should reverse myself: filtering > /24
is _not_ acceptable. My reasoning was that, if some idiot
announces each dialup /32, their upstream would want to use
filters. Prefixes shorter than /24 could quickly chew up 100K
routes (I'm too lazy to do the math, but anyone following this
thread is more than capable), but I'd consider the probability to
be rather low.

Providers are pretty good about distribute-list and filter-list
checks... if you want to advert more than space they provide, you
contact them out-of-band. Maybe we do something similar re
prefix length.

Sure, I'd love to move the route count problem to the edge. But
the whole reason that places filter is because they think that
the edge _isn't_ doing a good enough job. Perhaps we should
enforce prefix length adverts top-down, in the same manner that
IP space utilization is enforced?

Now, Verio's policy wouldn't be so bad if it correlated with
something official from RIRs. e.g.:

* Globally-routable /32 in 126/8
* Globally-routable /27-/30 in 125/8
* ... /24-/26 in 124/8

My complaint is that Verio is taking rather great liberties in
their filtering that, IMHO, _do not_ correlate well with
allocations. Good idea in an ideal world, but they need to
operate in reality.

If we can change reality... all the power to them. I'll turn
totally pro-filtering if we can accurately say "this is the
shortest globally-routable prefix allowed in _this_ netblock".

Wait a second... swamp /24s haven't all been returned yet. No,
the above paragraph just won't work. We could require
justification of existing netblocks... no, that would mean pain
for everyone, not just new allocations.

I guess that we'll keep going along Status Quo Road until we
run out of IPv4 space, then go on a big witch-hunt. Anyone care
to mark my words on this? (I only hope that I'm wrong!)

On the other hand, filterers do have a point: why are there so
many /24s in the global routing table? But then again, this
also happens to a lesser degree for larger blocks. Have a look
at the 24.x.y.z space, this is pretty ridiculous.

No kidding.

FWIW, we advert three routes: /22, /22, /23. A couple of /24s
will be announced soon. Like Jeff, we'd gladly renumber, into a
single /20 in our case. When we're through getting beaten up by
ARIN and get a PI /20, we'll do just that. In the mean time,
we'll keep advertising 3x as many routes as we should.

I know of another place that is renumbering into a PI /19, and
will finally give up about 8-10 longer prefixes. More table
pollution.

Current IP allocation policies are just not conduceive to
efficient routing tables. We little guys can't get portable
space, and upstreams must use theirs efficiently. Result?
Routing table fragmentation. Pre-CIDR days, anyone?

Obviously, some networks don't care about the size of the
routing table and announce hundreds of routes. Other networks
do, and filter the easy targets. (And some networks manage to
fall into both categories.) The result being that a group that
didn't cause the problem suffers and the problem is not really
solved.

Yup.

[ snip regional-routing discourse ]

Even better would be if the RIRs would divvy up the world in 10
- 20 regions, and allocate a /8 - /10 to each. That way, the
routers don't have to know all individual routes to some remote
region, but they can simply forward the traffic to a part of
the network that does know the region-specific routes.

Aggregation at its finest.

Furthermore, if one could _know_ that a given netblock was in a
specific geographical location, one could more easily correlate
latency with netblocks. Sure, 202/7 is APNIC territory. Alas,
that's just a best case under the current scenario.

If anybody bothers to reply to this, you will see that there
are numerous reasons why this isn't "the" solution. However, it

Anyone with The Solution is free to flame anything I have said.
All public lartings will be accepted.

may help some people some of the time, and it doesn't impact
those who don't want to use it. And, more importantly: it
doesn't require universal cooperation. Just the RIR's.

Even that could be difficult. But it's definitely orders of
magnitude better than universal cooperation.

Iljitsch van Beijnum

Eddy

Jeff_McAdams · October 3, 2001, 4:25pm

Also sprach E.B. Dreger

FWIW, we advert three routes: /22, /22, /23. A couple of /24s will be
announced soon. Like Jeff, we'd gladly renumber, into a single /20 in
our case. When we're through getting beaten up by ARIN and get a PI
/20, we'll do just that. In the mean time, we'll keep advertising 3x
as many routes as we should.

*IF* you get a /20. I don't know how anal ARIN is about it, but looking
at the rough numbers you posted, I'm not sure you technically qualify.
I could be wrong...at a remote pop at the moment waiting for cisco TAC
to call me back at the moment, so can't double-check.

Our experience was that we had a couple of /24's, a couple of /23's and
a /20. ARIN gave us another /20 without any requirements to renumber
out of any of our existing blocks (which, to be quite honest, we were
expecting to have to do). If they had given us a /19, we'd have been
fine (not looking *forward* to the process of renumbering, but perfectly
willing to do so) with the process of renumbering out of one or more of
our older blocks.

Again, not only is the incentive to renumber into more aggregatable
blocks not there, there's actually a *dis*incentive to do so.

Simon_Lyall · October 3, 2001, 7:01pm

I'm afraid that doesn't work. It's great when there is exactly one
provider and nobody multihomes. As soon as people start multihoming then
they have to start announcing smaller prefixes everywhere. Then people
will no longer have circuits to the previous monopoly provider so even if
you routed to the /8 it won't get through.

Sift things around for a few years and you have people in that region
connecting to every possible backbone provider plus most of the 2nd tiers
and misc other countries.

Take a look at 203.0.0.0/10 (from memory) which is Telstra's allocation
for Australia. Almost every single ip in that range is in Australia but
there are hundreds of different paths as ISPs in that range have switched
providers and circuits over the years.

Didn't we have this argument with 8+8 ?

Iljitsch_van_Beijnum · October 7, 2001, 8:38pm

> Even better would be if the RIRs would divvy up the world in 10 - 20
> regions, and allocate a /8 - /10 to each.

I'm afraid that doesn't work. It's great when there is exactly one
provider and nobody multihomes. As soon as people start multihoming then
they have to start announcing smaller prefixes everywhere.

Only when multihomers routinely connect to networks that only interconnect
outside the region. In other words: as long as there is at least one
widely-used interconnect point in the region, this should not be a
problem. (There are some (rare, IMHO) failure modes that are not fatal
with current practice that are in this scenario, though.)

10 to 20 regions means about three regions to a continent. That's not too
unreasonable.

Sift things around for a few years and you have people in that region
connecting to every possible backbone provider plus most of the 2nd tiers
and misc other countries.

But Asian/Australian networks tend to connect to the US west coast,
European networks to the US east coast. And even if a relatively large
number of exceptions exist, savings are possible.

Didn't we have this argument with 8+8 ?

I wasn't there... But the argument shouldn't be about how much this will
help, but about how much it will hurt. I don't think it will hurt anyone,
so even if there is just a chance that it will help, we should do it.

E.B_Dreger · October 7, 2001, 10:21pm

Date: Sun, 7 Oct 2001 22:38:44 +0200 (CEST)
From: Iljitsch van Beijnum <iljitsch@muada.com>

[ snip ]

10 to 20 regions means about three regions to a continent.
That's not too unreasonable.

Furthermore, nothing says that there must be a mapping stating
"this IP space is for this one region".

Let's say that, in the U.S., CHI is the base for "north", DFW for
"south", D.C. for "east", and Bay area for "west". All except
E/W are valid combos. (e.g.: being in KS, I could be in "north"
or "south", connected to CHI or DFW.)

The number of region combos is "4 choose 2 minus 1", or 5:

+ N/E 126.0.0.0/11
+ N/W 126.32.0.0/11
+ N/S 126.64.0.0/11
+ E/S 126.96.0.0/11
+ W/S 126.128.0.0/11

Assign IP space based on one of those regions...

> Sift things around for a few years and you have people in that region
> connecting to every possible backbone provider plus most of the 2nd tiers
> and misc other countries.

...rinse and repeat for E-US/W-EU, W-US/E-JP, etc.

But Asian/Australian networks tend to connect to the US west
coast, European networks to the US east coast. And even if a
relatively large number of exceptions exist, savings are
possible.

I agree. Any comments on my above overlapping system? It's
virtually impossible for one to no longer connect to one's "home"
region. If "two closest points" isn't flexible enough, we can
move to three closest points: "N choose 3 minus invalid_combos"
is still fewer routes by far than the status quo.

Let's take this a step further. Say that we divide the US into
these "major hubs":

Seattle, SF Bay, LA, San Diego, Phoenix, Salt Lake, Denver, DFW,
Kansas City, Saint Louis, Chicago, Atlanta, Miami, D.C., NYC,
Boston, Philadelphia, Twin Cities.

Yes, I'm ignoring many cities. So what. This is an example...
everyone feel free to tear it apart and improve upon it.

I count 18 different hubs. Now let's say that we divide address
space such that it a given netblock can be native to any of five
different hubs -- "18 choose 5" different netblocks = 8568
netblocks. Now consider how many are invalid... the actual
number is much lower. Using this logic, we can divide the entire
CONUS into a few thousand netblocks.

Let's say that I use 125.100.75.50/24. Let's further assume that
this is in 125.96.0.0/11, which is "KC+STL+CHI+DFW+DEN". Any
backbone provider servicing me in Wichita probably will connect
to one of those hubs.

I announce my /24 to Savvis and GBLX. They announce to peers.
Peers can agg geographic traffic as they please. Someone in
NYC who uses Sprint only sees 125.96.0.0/11, and knows that
Sprint can get there... and that's all that matters. To get
from NYC to Wichita, Sprint will interconnect with Savvis or GBLX
in KC, STL, CHI, DFW, or DEN.

I know that this creates peering problems, and the system won't
quite work as stated... but I'm trying to brainstorm to the
list in hopes that _something_ will come of it.

> Didn't we have this argument with 8+8 ?

I wasn't there... But the argument shouldn't be about how much
this will help, but about how much it will hurt. I don't think
it will hurt anyone, so even if there is just a chance that it
will help, we should do it.

Sort of... renumbering for naught is a bad thing. However, using
a new, even marginally better, policy on new IP space would help.

Back to server building...
Eddy

Iljitsch_van_Beijnum · October 8, 2001, 8:38pm

> 10 to 20 regions means about three regions to a continent.
> That's not too unreasonable.

Furthermore, nothing says that there must be a mapping stating
"this IP space is for this one region".

Let's say that, in the U.S., CHI is the base for "north", DFW for
"south", D.C. for "east", and Bay area for "west". All except
E/W are valid combos. (e.g.: being in KS, I could be in "north"
or "south", connected to CHI or DFW.)

There are many ways this could work.

I think a system where addresses are used in a smaller area would probably
be better: you can always decide to accept Kansas addresses in Chicago or
New York or Madrid if you want (as long as ISPs announce customer routes
everywhere), but once some people in Kansas only connect to Dallas and
others only to Chicago, some of the advantage is lost: you have to connect
to both.

> But Asian/Australian networks tend to connect to the US west
> coast, European networks to the US east coast. And even if a
> relatively large number of exceptions exist, savings are
> possible.

I agree. Any comments on my above overlapping system? It's
virtually impossible for one to no longer connect to one's "home"
region. If "two closest points" isn't flexible enough, we can
move to three closest points: "N choose 3 minus invalid_combos"
is still fewer routes by far than the status quo.

Suppose we are both global networks, but we interconnect only in a few
places. Suppose my idea of Kansas is "north" and yours is "south".
Obviously, if we could agree on an interconnect point where routes to
Kansas belong, we both wouldn't have to carry more specifics than the
regional aggregate outside this region. But if I accept your Kansas routes
in Dallas and you mine in Chicago, everything still works, there are just
no savings.

Let's take this a step further. Say that we divide the US into
these "major hubs":

[...]

Let's say that I use 125.100.75.50/24. Let's further assume that
this is in 125.96.0.0/11, which is "KC+STL+CHI+DFW+DEN". Any
backbone provider servicing me in Wichita probably will connect
to one of those hubs.

Yes, but what if there is no overlap? If two networks only know those
routes in that part of the country, but don't interconnect in that region,
there is a problem and more specifics have to be carried throughout a
larger part of the networks. However, a network that doesn't interconnect
in this region can accept Denver routes in the Bay area and Chicago routes
at the Sprint NAP, if Denver and Chicago use different regional prefixes.

> > Didn't we have this argument with 8+8 ?

> I wasn't there... But the argument shouldn't be about how much
> this will help, but about how much it will hurt. I don't think
> it will hurt anyone, so even if there is just a chance that it
> will help, we should do it.

Sort of... renumbering for naught is a bad thing. However, using
a new, even marginally better, policy on new IP space would help.

I don't think many people will renumber for this. And it only applies to
multihomers anyway, routes from well-aggregated PA space will presumably
still be carried world wide.

As long as we're on the subject: it wouldn't hurt if the regional
registries looked at allocating bigger chunks of address space to large
ISPs. There are ASes that announce hundreds of routes. That's not good. It
seems like the RIRs are afraid to carve off big chunks of address space.
Why? Assigning someone a /16 and keeping the next 15 /16s free in case he
comes back soon doesn't mean those 15 /16s can never be allocated to
anyone else any more. But giving someone a /16 and the next person the
next /16 DOES mean the first one will never be able to aggregate two /16s
into a /15.

Iljitsch van Beijnum