Scalability issues in the Internet routing system

As of the last time that I looked at it (admittedly quite awhile ago), something like 80% of the forwarding table had at least one hit per minute. This may well have changed given the number of traffic engineering prefixes that are circulating.

Tony

Hi
Thanks to everyone who sent helpfull advice for tracking this down.
I wanted to follow up and say that I tried every test imaginable and nothing
was found.

finnaly I got a period were I could do without skype and found that when
skype was off.. No more pings were going out at random times to networks all
over.

So.. I have no idea why skpe is doing that. I even upgraded it (as it
requested due to a newer version being out when I relogged back in)
And once again the pings go out.

Seems Damn odd for Skype to be doing that.

  Nicole

As of the last time that I looked at it (admittedly quite awhile ago), something like 80% of the forwarding table had at least one hit per minute. This may well have changed given the number of traffic engineering prefixes that are circulating.

Tony

Yea, but that's just me pinging everything and google and yahoo fighting over who has the most complete list of x rated sites.

and this probably depends greatly on the network, user-population,
business involved. Is it even a metric worth tracking?

It's a fight for eyeballs, isn't it? Routing table hits caused by spidering
from search engines will give a good indication of what percent of the
address space the spiders are covering. Of course, you need views from
a number of places, and some adjusting for the fact that the webservers are
usually clumped in very small pockets of address space.

On the other hand, if it can be established that 80% of the routing table
is hit every <N minutes>, which would tend to argue against caching a very
small subset, but the vast majority of the routing table hits are just spiders,
that may mean that a cache miss isn't as important as we thought...

Anybody got actual measured numbers on how much of the hits are just spiders
and Microsoft malware scanning for vulnerable hosts?

oops, I should have not replied to Blaine/Tony but directly to tony's
message :frowning: The real question was:

"If the percentage of hits is dependent on 'user population', 'business',
'network' is it even worth metricing for the purpose of design of the
device/protocol?"

Unless of course you want a 'sport' and 'offroad' switch on your router :slight_smile:

Assume you have determined that a percentage (20%, 80%, whatever) of
the routing table is really used for a fixed time period. If you
design a forwarding system that can do some packets per second for
those most used routes, all you need to DDoS it is a zombie network
that would send packets to all other destinations... rate-limiting and
dampening would probably come into place, and a new arms race would
start, killing operator's abilities to fast renumber sites or entire
networks and new troubleshooting issues for network operators.

Isn't just simpler to forward at line-rate ? IP look ups are fast
nowadays, due to algorithmic and architecture improvements... even
packet classification (which is n-tuple version of the IP look up
problem) is not that hard anymore. Algorithms can be updated on
software-based routers, and performance gains far exceed Moore's Law
and projected prefix growth rates... and routers that cannot cope with
that can always be changed to handle IGP-only routes and default
gateway to a router that can keep up with full routing.
(actually, hardware-based routers based on limited size CAMs are more
vulnerable to obsolescence by routing table growth than software ones)

Let's celebrate the death of "ip route-cache", not hellraise this fragility.

Rubens

Vice versa. DDOS attack will never work by this way, because this router
will (de facto) prioritize
long established streams vs. new and random ones, so it will not notice DDOS
attack at all - just some DDOS packets will be delayed or lost.

You do not need to forward 100% packets on line card rate; forwarding 95%
packets on card rate and have other processing (with possible delays) thru
central CPU can work good enough.

It is all about tricks and optimizations - fast routing is not state of art
and can be optimized by many ways.
For now, it was not necessary; when it became necessary - it will be done in
1/2 year.

Alexei Roudnev wrote:

You do not need to forward 100% packets on line card rate; forwarding 95%
packets on card rate and have other processing (with possible delays) thru
central CPU can work good enough..

heh.
in the words of Randy, "i encourage my competitors to build a router this way".

reality is that any "big, fast" router is forwarding in hardware - typically an ASIC or some form of programmable processor.
the lines here are getting blurry again .. Moore's Law means that packet-forwarding can pretty much be back "in software" in something which almost resembles a general-purpose processor - or maybe more than a few of them working in parallel (ref: <Technology Consulting Services | IBM).

if you've built something to be 'big' and 'fast' its likely that you're also forwarding in some kind of 'distributed' manner (as opposed to 'centralized').

as such - if you're building forwarding hardware capable of (say) 25M PPS and line-rate is 30M PPS, it generally isn't that much of a jump to build it for 30M PPS instead.

i don't disagree that interfaces / backbones / networks are getting faster - but i don't think its yet a case of "Moore's law" becoming a problem - all that happens is one architects a system far more modular than before - e.g. ingress forwarding separate from egress forwarding.

likewise, "FIB table growth" isn't yet a problem either - generally that just means "put in more SRAM" or "put in more TCAM space".

IPv6 may change the equations around .. but we'll see ..

cheers,

lincoln.

likewise, "FIB table growth" isn't yet a problem either - generally that
just means "put in more SRAM" or "put in more TCAM space".

IPv6 may change the equations around .. but we'll see ..

IPv6 will someday account for as many IPv4 networks as would exist
then, and IPv6 prefixes are twice as large as IPv4 (64 bits prefix vs
32 bits prefix+address, remainder 64 bits addresses on IPv6 are
strictly local), so despite a 3x cost increase (1 32 bit table for
IPv4, 2 for IPv6) on memory structures and 2x increase on lookup
engine(2 engines would be used for IPv6, one for IPv4), the same
techonology that can run IPv4 can do IPv6 at the same speed. As this
is not a usual demand today, even hardware routers limit the
forwarding table to the sum of IPv4 and IPv6 prefixes, and forward
IPv6 at half the rate of IPv4.

Rubens

s/64/128/

...and total, complete, non-sense. please educate yourself more on reality
of inet6 unicast forwarding before speculating. Thank you.

James

Forwarding is in line cards not because of CPU issues, but because of BUS
issues. It means, that card can be software based easily.

Anyway, as I said - it is only small, minor engineering question - how to
forward having 2,000,000 routes. If internet will require such router - it
will be crearted easily. Today we eed 160,000 routes - and it works (line
cards,m software, etc - it DO WORK).

Forwarding packets is only half the story. Building a routing table is
the other half.

Route flaps. Even if you have an algorithm that's O(n), 2M routes will take
12.5 times as long to crunch as 160K. If your routing protocol is O(n**2) on
number of routes, that's about 150 times as much.

Such a router is probably buildable. I'm not at all convinced that it's "easy"
to do so at a price point acceptable for most sites that currently have full
routing tables.

There are definitely performance challenges to overcome. Of course, most route processors are underpowered compared to the existing state of the art for processors so there is some wiggle room.

With both Cisco and Juniper we have a nice period of hang time as "brand new" new routes get installed. Both vendors are playing with layers of abstraction to improve things once up and operational but increasing the amount of time to bring a device "online" is factor which influences purchasing decisions as well.

It does seem appropriate to consider Gigabit sized routing/forwarding table interconnects and working on TCP performance optimization for BGP specifically, if any improvement remains. Combine those things with a chunky CPU and you are left with pushing data as fast as possible into the forwarding plane (need speedy ASIC table updates here).

Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.

Regards,

Blaine

Blaine Christian wrote:

It does seem appropriate to consider Gigabit sized routing/forwarding table interconnects and working on TCP performance optimization for BGP specifically, if any improvement remains. Combine those things with a chunky CPU and you are left with pushing data as fast as possible into the forwarding plane (need speedy ASIC table updates here).

I guess you got something wrong here. Neither BGP nor TCP (never has been)
are a bottleneck regarding the subject of this discussion.

Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.

You may want to read this thread from the beginning. The problem is not
the routing plane or routing protocol but the forwarding plane or ASIC's
or whatever. Both have very different scaling properties. The forwarding
plane is at an disadvantage here because at the same time it faces growth
in table size and less time to perform a lookup . With current CPU's you
can handle a 2M prefix DFZ quite well without killing the budget. For the
forwarding hardware this ain't the case unfortunatly.

Blaine Christian wrote:
> It does seem appropriate to consider Gigabit sized routing/forwarding
> table interconnects and working on TCP performance optimization for BGP
> specifically, if any improvement remains. Combine those things with a
> chunky CPU and you are left with pushing data as fast as possible into
> the forwarding plane (need speedy ASIC table updates here).

I guess you got something wrong here. Neither BGP nor TCP (never has been)
are a bottleneck regarding the subject of this discussion.

i think he's describing initial table gather/flood and later massage of
that into FIB on cards ... which relates to his earlier comment about
'people still care about how fast initial convergence happens' (which is
true)

> Another thing, it would be interesting to hear of any work on breaking
> the "router code" into multiple threads. Being able to truly take
> advantage of multiple processors when receiving 2M updates would be the
> cats pajamas. Has anyone seen this? I suppose MBGP could be rather
> straightforward, as opposed to one big table, in a multi-processor
> implementation.

You may want to read this thread from the beginning. The problem is not
the routing plane or routing protocol but the forwarding plane or ASIC's

it's actually both... convergence is very, very important. Some of the
conversation (which I admit I have only watched spottily) has covered this
too.

or whatever. Both have very different scaling properties. The forwarding
plane is at an disadvantage here because at the same time it faces growth
in table size and less time to perform a lookup . With current CPU's you
can handle a 2M prefix DFZ quite well without killing the budget. For the

really? are you sure about that? are you referrinng to linecard CPU or
RIB->FIB creation cpu? (be it monolithic design or distributed design)

forwarding hardware this ain't the case unfortunatly.

this could be... I'm not sure I've seen a vendor propose the cost
differentials though.

Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.

You may want to read this thread from the beginning. The problem is not
the routing plane or routing protocol but the forwarding plane or ASIC's
or whatever. Both have very different scaling properties. The forwarding
plane is at an disadvantage here because at the same time it faces growth
in table size and less time to perform a lookup . With current CPU's you
can handle a 2M prefix DFZ quite well without killing the budget. For the
forwarding hardware this ain't the case unfortunatly.

Hi Andre...

I hear what you are saying but don't agree with the above statement. The problem is with the system as a whole and I believe that was the point Vladis, and others, were making as well. The forwarding plane is only one part of the puzzle. How do you get the updates into the forwarding plane? How do you get the updates into the router in the first place and how fast can you do that? I have seen at least one case where the issue did not appear to be the ASICs but getting the information into them rapidly. If you go and create a new ASIC without taking into account the manner in which you get the data into it you probably won't sell many routers <grin>.

BTW, I do agree that spinning new ASICs is a non-trivial task and is certainly the task you want to get started quickly when building a new system.

I did read your comment on BGP lending itself to SMP. Can you elaborate on where you might have seen this? It has been a pretty monolithic implementation for as long as I can remember. In fact, that was why I asked the question, to see if anyone had actually observed a functioning multi-processor implementation of the BGP process.

Regards,

Blaine

Blaine Christian wrote:

Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.

You may want to read this thread from the beginning. The problem is not
the routing plane or routing protocol but the forwarding plane or ASIC's
or whatever. Both have very different scaling properties. The forwarding
plane is at an disadvantage here because at the same time it faces growth
in table size and less time to perform a lookup . With current CPU's you
can handle a 2M prefix DFZ quite well without killing the budget. For the
forwarding hardware this ain't the case unfortunatly.

Hi Andre...

I hear what you are saying but don't agree with the above statement. The problem is with the system as a whole and I believe that was the point Vladis, and others, were making as well. The forwarding plane is only one part of the puzzle. How do you get the updates into the forwarding plane? How do you get the updates into the router in the first place and how fast can you do that? I have seen at least one case where the issue did not appear to be the ASICs but getting the information into them rapidly. If you go and create a new ASIC without taking into account the manner in which you get the data into it you probably won't sell many routers <grin>.

Sure, if you have a bottleneck at FIB insertion you fail much earlier.
I'd say if that happens it's an engineering oversight or a design
tradeoff. However I don't think this is the choke point in the entire
routing table size equation.

Depending on the type of prefix churn you don't have that many transactions
reaching the FIB. Most far-away churn doesn't change the next hop for
example. Local churn, when direct neighbors flap, mostly just changes the
nexthop (egress interface). In a high performant ASIC/TCAM whatever FIB
a nexthop change can be done quite trivially. Prefix drop can be handled
by marking it invalid and garbage collecting it later. Prefix insertions
may either salvage an invalidated prefix or have to be re-inserted. The
insertion time depends on the algorithms of the FIB table implementation.

For all practical purposes a FIB can be designed to be quite speedy in this
regard without busting the budget.

The link speed between two DFZ routers has seldomly been the limit for initial
routing table exchanges. Neither has TCP. It is mostly dominated by the
algorithm choice and CPU of the RIB processor on both ends.

BTW, I do agree that spinning new ASICs is a non-trivial task and is certainly the task you want to get started quickly when building a new system.

It is non-trivial for its prefix storage size and ultra-fast lookup times.
Longest prefix match is probably the most difficult thing to scale properly
as a search always must be done over a number of overlapping prefixes.

To scale this much better and remove the bottleneck you may drop the 'overlapping'
part or the 'longest-match' part and the world suddenly looks much brighter.

This is the crucial thing that got forgotten during the IPng design phase
which brought us IPv6.

So far we have learned that limiting the number of IPv[46] prefixes in the
DFZ is not an option for commercial and socio-technical reasons. That leaves
only the other option of changing the routing lookup to something with better
scaling properties.

I did read your comment on BGP lending itself to SMP. Can you elaborate on where you might have seen this? It has been a pretty monolithic implementation for as long as I can remember. In fact, that was why I asked the question, to see if anyone had actually observed a functioning multi-processor implementation of the BGP process.

I can make the SMP statement with some authority as I have done the internal
design of the OpenBGPd RDE and my co-worker Claudio has implemented it. Given
proper locking of the RIB a number of CPU's can crunch on it and handle neighbor
communication indepently of each other. If you look at Oracle databases they
manage to scale performance with factor 1.9-1.97 per CPU. There is no reason
to believe we can't do this with the BGP 'database'.

I did read your comment on BGP lending itself to SMP. Can you elaborate on where you might have seen this? It has been a pretty monolithic implementation for as long as I can remember. In fact, that was why I asked the question, to see if anyone had actually observed a functioning multi-processor implementation of the BGP process.

I can make the SMP statement with some authority as I have done the internal
design of the OpenBGPd RDE and my co-worker Claudio has implemented it. Given
proper locking of the RIB a number of CPU's can crunch on it and handle neighbor
communication indepently of each other. If you look at Oracle databases they
manage to scale performance with factor 1.9-1.97 per CPU. There is no reason
to believe we can't do this with the BGP 'database'.

Neat! So you were thinking you would leave the actual route selection process monolithic and create separate processes per peer? I have seen folks doing something similar with separate MBGP routing instances. Had not heard of anyone attempting this for a "global" routing table with separate threads per neighbor (as opposed to per table). What do you do if you have one neighbor who wants to send you all 2M routes though? I am thinking of route reflectors specifically but also confederation EIBGP sessions.

I think you hit the nail on the head regarding record locking. This is the thing that is going to bite you if anything will. I have heard none of the usual suspects speak up so I suspect that either this thread is now being ignored or no one has heard of an implementation like the one you just described.

Alexei Roudnev wrote:

Forwarding is in line cards not because of CPU issues, but because of BUS
issues.

i respectfully disagree.
"centralized forwarding" only gets you so far on the performance scale. "distributed forwarding" is a (relatively) simple way to scale that performance.

just take a look at any 'modern' router (as in, something this century) with a requirement of (say) >10M PPS.

sure - there are reasons why one DOES have to go to a distributed model - 'bus limitations' as you say .. but i'd more classify those as phsycal chip-packaging limitations - how many pins you can put on a chip, how 'wide' the memory-bus needs to be as the PPS goes up.

It means, that card can be software based easily.

once again - disagree. it _may_ be that it means that forwarding can be in software - but for the most part the determining factor here is what is the PPS required for the function.

i've previously posted a categorization of requirements in a router based on their function -- see <http://www.merit.edu/mail.archives/nanog/2005-09/msg00635.html&gt;

i think _software-based_ works for /some/ types of router functions - but nowhere near all - and certainly not a 'core' router this century.

Anyway, as I said - it is only small, minor engineering question - how to
forward having 2,000,000 routes. If internet will require such router - it
will be crearted easily. Today we eed 160,000 routes - and it works (line
cards,m software, etc - it DO WORK).

if you're looking at routers based on their classification, clearly there isn't a requirement for all types of routers to have a full routing table.

but for a 'core router' and 'transit/peering routers', the ability to work with a full routing-table view is probably a requirement - both now, and into the future.

there have been public demonstrations of released routers supporting upwards of 1.5M IPv4+IPv6 prefixes and demonstrations on routing churn convergence time. <http://www.lightreading.com/document.asp?doc_id=63606&gt; contains one such public test.

cheers,

lincoln.