Cogent service

The hop count question is interesting. Is the consensus that it's
mostly a customer service issue, where latency isnt affected but
customer perception is? Or is it a real latency issue as more
routers take a few CPU cycles to make a routing decision.

To some extent, that's an "apples or oranges" statement of the
problem. Longer paths present more opportunities for latency to be
caused by queueing delays on potentially congested interfaces and
such. Customer service is always going to be an issue because the
details that lead to a path with more segments providing better
service are hard to explain, and real side-by-side comparison is often
difficult.

Regarding CPU cycles for route lookups:

Back in the mid-1990s, route lookups were expensive. There was a lot
of hand-wringing, in fact, about how doing route lookups at every hop
in larger and larger FIBs had a negative impact on end-to-end
performance. One of the first reasons for the existence of MPLS was to
solve this problem.

In the late-1990s, though, ASIC-based route lookup made the problem go
away - route lookups could be done in just a couple memory cycles,
with retrieval times comparable to looking up MPLS labels. Problem
solved (and the MPLS people moved on to try the TE problem space to
justify their existence).

In the case of modern hardware, forwarding delay due to route lookup
times is probably not a contributing factor to latency. "More hops"
can mean many other things in terms of delay, though - in both the
"more delay" and "less delay" directions, and countering the "more
hops means more delay" perception will (I think) always be a
challenge.

Stephen

Regarding CPU cycles for route lookups:

Back in the mid-1990s, route lookups were expensive. There was a lot
of hand-wringing, in fact, about how doing route lookups at every hop
in larger and larger FIBs had a negative impact on end-to-end
performance. One of the first reasons for the existence of MPLS was to
solve this problem.

This was a totally bogus reason from the very beginning. Given that real
backbones carry no prefixes longer than 24 bits the "long" lookup in
16-4-4-4-4 bit radix tree takes 5 memory reads. Given that a lot of
routes are aggregated, and the ubiquity of fast data caches (which can
safely hold 3 top levels of full FIB) the average number of memory reads
needed by a general-purpose CPUs available in mid-90s to do route lookup
is (surprise) - about 1.2

In fact, full-featured IP forwarding (including Fair Queueing and packet
classification) at 120 kpps was demonstrated using a 133MHz Pentium MMX.
Did beat crap out of 7000's SSE. Wasn't that hard, too; after all it is
more than 1000 CPU cycles per packet. The value proposition of ASIC-based
packet routing (and use of CAMs) was always quite dicey.

For arithmetically inclined, I'd offer to do the same calculation for P-4
running at 2 Ghz, assuming 1000 cycles per packet, and average packet size
of 200 bytes. Then estimate cost of that hardware and spend some time
wondering about prices of OC-192 cards and about virtues of duopoly :slight_smile:
Of course, doing "line rate" forwarding of no-payload Christmas-tree
packets is neat, but does anyone really need it?

In the case of modern hardware, forwarding delay due to route lookup
times is probably not a contributing factor to latency.

In the real life (as opposed to benchmark tests) nobody cares about
forwarding delay. Even with dumb implementations it is 2 orders of
magnitude smaller than light of speed delays in long-haul circuits (and on
short hops end-host performance is limited by bus speed anyway).

"More hops" can mean many other things in terms of delay, though - in
both the "more delay" and "less delay" directions, and countering the
"more hops means more delay" perception will (I think) always be a
challenge.

"More hops" means more queues, i.e. potential congestion points. With max
queue size selected to be RTT*BW, the maximal packet latency is,
therefore, Nhops*RTT. In a steady mode, with all network uniformly loaded
the mean latency is, therefore, (Nhops*MeanQueueLength + 1)*RTT. (MQL is
from 0 (empty queue) to 1 (full queue)). In the real networks, queueing
delays tend to contribute mostly to delay variance, not to the mean delay
(in an underloaded network the average queue size is < 1 packet).

In other words, a large number of hops makes life a lot less pleasant for
interactive applications; file transfers generally don't care.

--vadim

Thus spake "Vadim Antonov" <avg@exigengroup.com>

> Regarding CPU cycles for route lookups:
>
> Back in the mid-1990s, route lookups were expensive. There was a lot
> of hand-wringing, in fact, about how doing route lookups at every hop
> in larger and larger FIBs had a negative impact on end-to-end
> performance. One of the first reasons for the existence of MPLS was to
> solve this problem.

FIBs did not exist (in production routers) at the time MPLS aka tag switching
was invented. The problem was that the day's cache-based routers could not
handle the growing number of destinations on the Internet and crumbled under the
load of creating and aging cache entries. In networks with limited
destinations, this technology still works well in the tens of Mpps range.

It's arguable that MPLS hasn't really caught on yet precisely because FIBs (aka
CEF) removed the cache problems from the equation.

This was a totally bogus reason from the very beginning. Given that real
backbones carry no prefixes longer than 24 bits the "long" lookup in
16-4-4-4-4 bit radix tree takes 5 memory reads.

What routers use a 16-4-4-4-4 radix tree? Vendor C uses a 256-way radix tree
for its RIB and a 16-8-8 mtrie+ for its FIB. Vendor J uses a 20-1-1-1-1-1-1...
mtrie for its FIB. Or so I've read.

Given that a lot of
routes are aggregated, and the ubiquity of fast data caches (which can
safely hold 3 top levels of full FIB) the average number of memory reads
needed by a general-purpose CPUs available in mid-90s to do route lookup
is (surprise) - about 1.2

Let's see:

250kpps
1.2rds/packet
128bytes/FIB entry
8bytes/read
1read/4cycles

That comes out to 4*16*1.2*250k = 19.2M cycles on the FSB just doing route
lookups. CPUs of the day had 33MHz and slower FSB's. Not going to happen.

You're also way off on how much FIB fits in cache; the first level of vendor C's
FIB is in the *megs* and no CPU of that era had that much on-die cache; it's
rare even on CPUs today.

Sure, you can make the FIB a lot smaller to fit the cache, but then you have to
turn the mtrie+ into a plain mtrie, throw out load-sharing, etc. I've done it,
and it doesn't bear any resemblance to a *real* FIB implementation you can sell.

In fact, full-featured IP forwarding (including Fair Queueing and packet
classification) at 120 kpps was demonstrated using a 133MHz Pentium MMX.
Did beat crap out of 7000's SSE. Wasn't that hard, too; after all it is
more than 1000 CPU cycles per packet. The value proposition of ASIC-based
packet routing (and use of CAMs) was always quite dicey.

If you think you can make a gigabit router with PC parts, feel free.

S

Stephen Sprunk wrote:

FIBs did not exist (in production routers) at the time MPLS aka tag switching
was invented. The problem was that the day's cache-based routers could not
handle the growing number of destinations on the Internet and crumbled under the
load of creating and aging cache entries. In networks with limited
destinations, this technology still works well in the tens of Mpps range.

Yes, however to get the full picture one must understand that C shipped tag-switching
with the requirement of running CEF. So the cache-problem was solved conviniently
at the same time tag-switching came in.

> Given that a lot of
> routes are aggregated, and the ubiquity of fast data caches (which can
> safely hold 3 top levels of full FIB) the average number of memory reads
> needed by a general-purpose CPUs available in mid-90s to do route lookup
> is (surprise) - about 1.2

Let's see:

250kpps
1.2rds/packet
128bytes/FIB entry
8bytes/read
1read/4cycles

That comes out to 4*16*1.2*250k = 19.2M cycles on the FSB just doing route
lookups. CPUs of the day had 33MHz and slower FSB's. Not going to happen.

What bits are in the 128 bytes on a FIB entry you have to read? Sounds like
1-1-1-1-... tree would be more optimal with the neccessity of reading only
one word for lookup than the 16 read per node tree you're proposing.

You're also way off on how much FIB fits in cache; the first level of vendor C's
FIB is in the *megs* and no CPU of that era had that much on-die cache; it's
rare even on CPUs today.

And cache does not help you that much anyway because usually you end up doing
queue management on the same CPU and keeping the "interesting stuff" in the cache
is challenging at best.

Sure, you can make the FIB a lot smaller to fit the cache, but then you have to
turn the mtrie+ into a plain mtrie, throw out load-sharing, etc. I've done it,
and it doesn't bear any resemblance to a *real* FIB implementation you can sell.

Is lookup performance consistency or the depth of host-prefix lookups the main
reasons why you say that it wouldn't work? In any case, usually you look up
less than 1% of your nodes >90% of the time. Having a root node of 13-16 bits
cuts your lookups in less than half without increasing cache pollution.

> In fact, full-featured IP forwarding (including Fair Queueing and packet
> classification) at 120 kpps was demonstrated using a 133MHz Pentium MMX.
> Did beat crap out of 7000's SSE. Wasn't that hard, too; after all it is
> more than 1000 CPU cycles per packet. The value proposition of ASIC-based
> packet routing (and use of CAMs) was always quite dicey.

If you think you can make a gigabit router with PC parts, feel free.

I think somebody just announced that they've shipped 250000 routers built out
of PC parts :slight_smile:

Pete

You may be surprised to learn that BBN folks did practically that
(different CPU) with their 50 Gbps box (MGR). They had OC-48C line cards
and used Alpha 21164 CPU with pretty small 8Kb/94Kb on-chip cache to do
packet forwarding.

See C.Partridge et al in IEEE/ACM Transactions on Networking,
6(3):237-248, June 1998.

The CPUs are quite faster nowadays, and you can get things like _quad_
300MHz PPC core on an FPGA plus 20+ 3.2Gbps serdes I/Os - all on one chip.
So building multigigabit software router is a no-brainer.

(16-4-4-4-4 was in Pluris proof-of-concept; the smaller branching factor
in the radix tree was to get a better match to Pentium's L-1 cache line
size, and to make prefix insertion/deletion faster).

--vadim

PS. We were talking _mid_ 90s. Back then SSE did about 110kpps (not
     advertised 250kpps) in the real life, and 166MHz Pentiums
     were quite available at Fry's.

PPS. I had exactly that argument with ex-cisco hardware folks who came to
     Pluris; they prevailed, and fat gobs of luck it brought them. They
     ended up building a box with exactly the same fabric and comparable
     mechanical and power dissipation parameters as the concept I drew as
     a starting point in 98. Wasted time (and $260M) on developing these
     silly ASICs when CPUs and some FPGAs could do just as nicely. I'm
     glad that I wasn't around to participate in that folly.