RE: too many routes

L2 switches are available today that reliably receive OC-12 SONET
circuits.

These can be disaggregated into OC3 ATM pipes that can be fed into
many routers with proven technology and reliability. Granted, the
disproportionality of the edge ckts to the backbone ckts provides
for interesting flow aggregation dynamics, but it does work.

Your disdain for ATM does not stop its existence and use by the
larger NSPs.

Just think, there are people out there "throwing away" an oc-3 worth of
bandwidth to IP over ATM overhead. Must be nice to live in a world of
capitalization where one could do such a thing. We use ATM for two
reasons, 1) it's still significantly cheaper than long-haul circuits of the
same capacity, 2) it provides some interesting abilites that are only
now beginning to show up in the mainstream IP hardware.

-- additional commentary by yours truly removed by BS filter --

- Chris

"Chris A. Icide" <chris@nap.net> writes:

We use ATM for two
reasons, 1) it's still significantly cheaper than long-haul circuits of the
same capacity,

My canonical explanation for this is that people are
actually deluding themselves into thinking that ABR will
work and the "quiet moments" across a large number of VCs
can effectively be statmuxed out of existence without
hurting goodput.

The apocryphal reason is that people with too much
influence in carriers' decision making processes are
desperately trying to gain enough revenue to justify
the ridiculously large amount of money spent on deploying
ATM and convincing everyone it was the way the truth and
the light of the future, even if that revenue isn't as
profitable as selling raw bandwidth. (cf. the canonical explanation)

There are cases, however, involving inter-carrier
handoffs where muxing at the virtual tributary/virtual
container level doesn't work particularly well end-to-end,
thus making ATM an alternative to SDH<>PDH<>SDH
conversions. These cases are becoming rarer over time as
people deploy modern SONET/SDH muxing and terminal equipment.

2) it provides some interesting abilites that are only
now beginning to show up in the mainstream IP hardware.

Ok, I'll bite: which ones?

The only ones I can think of right off the top of my head
involve the counting problem. (Modulo easy deployment of
cisco's rate limiting and/or the ability to make tunnels fast).

Rather, I guess the question is, which of the "interesting
abilities" (which I agree are interesting in a theoretical
sense) are actually practically useful when running part
of the Internet?

  Sean.

Ok. I will bite, although I hate to open my mouth, as my shoe always
seems to bee-line for it.. ;}

Sean M. Doran wrote:

"Chris A. Icide" <chris@nap.net> writes:

> We use ATM for two
> reasons, 1) it's still significantly cheaper than long-haul circuits of the
> same capacity,

   Yup.

My canonical explanation for this is that people are
actually deluding themselves into thinking that ABR will
work and the "quiet moments" across a large number of VCs
can effectively be statmuxed out of existence without
hurting goodput.

   I don't think so.... how about the ability to mix voice,
MPEG, and IP on the same pipe ? Or, how about that with ABR my delay
across the
ATM fabric is reduced when I have more bandwidth open. (POTS is low on
utilization,
during this "theoretical moment in time") A couple milliseconds and a
few extra Mbs can count :wink:

people deploy modern SONET/SDH muxing and terminal equipment.

> 2) it provides some interesting abilites that are only
> now beginning to show up in the mainstream IP hardware.

Ok, I'll bite: which ones?

  Oh, 2 things come to mind, my variability throughout an ATM cloud is
greatly reduced versus a routing cloud, a cell requires WAY less time to
cross a switches backplane, versus a packet through a router. And
seriuosly less time to determine where to send it...

   Ok. So, maybe Cisco's Flow Switching approaches VBR having a bad hair
day. (and tuned for SERIOUS tolerance, CDVT=10,000), but certainly not
traditional routing.

  And, on ATM, my neighbors traffic never bothers ME. Unless I am
sending to him, and he is running lossy, then it affects him ONLY...
Most ATM switches have massive backplanes, the problem is usually the
port/pipe of the greedy carrier, and does not affect a neighbor. The
greed mongers can trash their own ports/pipes, but not yours... (now, if
you happen to have paths through a monger.... sigh...)

I can't really remember the last time I experienced HOL on my ATM ports
(Historical Jibe: :wink:

  On ATM QOS is available now. IP is getting there. The only REAL
problem with ATM's QOS, at this time, is the ability for IP to allocate
it ...... (At least for those who run the latest spec ATM nets) Legacy
switches
are not being brought into this.....

I wouldn't mind if you weren't my (ATM) neighbor. :wink: (And a GOOD one at
that....)

Rather, I guess the question is, which of the "interesting
abilities" (which I agree are interesting in a theoretical
sense) are actually practically useful when running part
of the Internet?

   See above.

      Richard.

mailto://rirving@onecall.net
http://www.onecall.net/

      A technical with too much influence in a carriers decision making
process,
desperately trying to gain enough revenue to justify the ridiculously
large
amount of money he spent on deploying ATM, and convincing everyone it IS
the way, the truth, and
the light of the future, even if it isn't CHEAPER than selling raw
bandwidth ;>

     Quality Rules.

Richard Irving <rirving@onecall.net> writes:

Ok. I will bite, although I hate to open my mouth, as my shoe always
seems to bee-line for it.. ;}

Hehe.

   I don't think so.... how about the ability to mix
voice, MPEG, and IP on the same pipe ?

Um, I do this now with IP.

Admittedly with parallel circuits (virtual or hard) I
could send such traffic down different pipes to partition
congestion effects, however to do this right I'd really
want to use MPLS/tag switching anyway.

When I make the decision to use MPLS/tag switching I also
have to consider that there is decent queuing available in
modern IP routers (that will effectively become hybrid IP
routers and MPLS/tag switches) and that I can neatly
partition the traffic without using actual or virtual
circuits.

Or, how about that with ABR my delay across the ATM
fabric is reduced when I have more bandwidth open. (POTS
is low on utilization, during this "theoretical moment
in time") A couple milliseconds and a few extra Mbs can
count :wink:

You want bounded delay on some traffic profiles that
approach having hard real time requirements. (Anything
that has actual hard real time requirements has no
business being on a statistically multiplexed network, no
matter what the multiplexing fabric is). This can be
implemented in routers now, with or without MPLS/tag
switching, although having the latter likely makes
configuration and maintenance easier.

ABR is another attempt to do statistical multiplexing over a
substrate that is not well geared to anything other than
TDM. It interacts poorly with any protocol that is
developed to run over a statistically-multiplexed network
(e.g. TCP) and there are little demons with respect to the
way RMs are handled that can lead to nasty cases where you
really don't get the bandwidth you ought to.

The problem again is that mixing TCP and other
statmux-smart protocols with ABR introduces two parallel
control loops that have no means of communication other
than the interaction of varying traffic load, varying
delay, and lost data. This often leads to the correct
design of additive increase/multiplicative decrease
traffic rate response to available bandwidth leading to a
stair-step or vacillation as more bandwidth becomes
available to each control loop, and a rather serious
backing off at the higher level when available bandwidth
is decreased even fairly gently.

Delay across any fabric of any decent size is largely
determined by the speed of light. Therefore, unless ABR
is deliberately inducing queueing delays, there is no way
your delay can be decreased when you send lots of traffic
unless the ATM people have found a way to accelerate
photons given enough pressure in the queues.

  Oh, 2 things come to mind, my variability throughout an ATM cloud is
greatly reduced versus a routing cloud, a cell requires WAY less time to
cross a switches backplane, versus a packet through a router. And
seriuosly less time to determine where to send it...

Um you need to be going very slowly and have huge packets
for the passage through a backplane to have any meaning
compared to the centisecon propagation delays observed
on long distance paths.

I know of no modern router that delays packets for
anything approaching a handful of microseconds on fast interfaces
except in the presence of outbound queues being congested,
where if you're running TCP you really want to induce
delay rather anyway, so that the transmitter will slow
down.

   Ok. So, maybe Cisco's Flow Switching approaches VBR having a bad hair
day. (and tuned for SERIOUS tolerance, CDVT=10,000), but certainly not
traditional routing.

The analogy between VBR and flow switching confuses me.
Could you explain this a bit? Actually, maybe you could
explain the rest of the message too, because I think we
have a disconnect in terms of vocabulary. :frowning:

  Sean.

Hmm... I thought we went over this fallacy not that long ago on NANOG.
Please look up the past NANOG thread with subject _Internet Backbone
Index_.

-dorian

Sean M. Doran wrote:

You want bounded delay on some traffic profiles that
approach having hard real time requirements. (Anything
that has actual hard real time requirements has no
business being on a statistically multiplexed network, no
matter what the multiplexing fabric is).

Hey, if you make sure that _class_ of traffic has enough
bandwish (no matter how overloaded the network is by other
classes of traffic), you're pretty much ok with stochastical
muxing.

This can be
implemented in routers now, with or without MPLS/tag
switching, although having the latter likely makes
configuration and maintenance easier.

...or harder :slight_smile: Adding more complexity seldom makes life
easier.

Delay across any fabric of any decent size is largely
determined by the speed of light. Therefore, unless ABR
is deliberately inducing queueing delays, there is no way
your delay can be decreased when you send lots of traffic
unless the ATM people have found a way to accelerate
photons given enough pressure in the queues.

Well, that's not absolute delay which is interesting (can't
do anything about it anyway short of exploiting EPR paradox, or
drilling a hole all the way to Australia), but rather variance
of delay.

Having large variance is equivalent to adding that variance to
the propagation time :frowning: Given that in single-class tail-drop
IP networks the variance can be as large as RTT*number_of_hops
(if the network has the buffers of just right size), the effect
can be significant.

What worries me about ABR and other traffic shaping stuff more
is a) scalability and b) it encourages real-time transmission of
non-realtime "canned" contents.

I would expect that 99% of all A/V feeds over the network will
be unidirectional (tv, movies, web, etc), and so can be delayed
by hundereds of ms without any ill effects. The only really
interactive kind of A/V contents (telephony / video telephony) is
so low-bandwidth compared to full-motion movies that we probably
shouldn't care about it.

--vadim

> I don't think so.... how about the ability to mix
> voice, MPEG, and IP on the same pipe ?

Um, I do this now with IP.

  Do you? When I mean voice, I mean true POTS. You present dialtone over
IP ?
(It could be done, btw), but POTS over ATM can exit right into the
DEXX's.
(DEXX = Big telco switch) Lets see you allocate an ESF B8ZS Clear
Channel T1 over IP....

Admittedly with parallel circuits (virtual or hard) I
could send such traffic down different pipes to partition
congestion effects, however to do this right I'd really
want to use MPLS/tag switching anyway.

  Ahhh, tag switching, I am on that particular holy grail as well.....
How many parallel paths have you ran on layer 3 ? Ever watched the
variability ?
( * shiver * ) Now, tell me parallel paths on IP are smooth with todays
technology!
Audio sounds great with lots of variability ..... not. However,
Stratacom can Bond 3 DS3's into
1 OC3, and you would never know the difference.

When I make the decision to use MPLS/tag switching I also
have to consider that there is decent queuing available in
modern IP routers (that will effectively become hybrid IP
routers and MPLS/tag switches) and that I can neatly
partition the traffic without using actual or virtual
circuits.

   Hold it. No actual or virtual circuits... Not even SVC's ? :wink:
OK. So there is a new name for the flow paths that the TAG switches
allocate,
what, pray tell, is the new name for these SVC's ?

You want bounded delay on some traffic profiles that
approach having hard real time requirements. (Anything
that has actual hard real time requirements has no
business being on a statistically multiplexed network, no
matter what the multiplexing fabric is).

  Such as voice? Why do you think SDM was created in the first place?
Or do you mean like a military application, 2ms to respond to a nuke....
That is when channel priorities come into play.

This can be
implemented in routers now, with or without MPLS/tag
switching, although having the latter likely makes
configuration and maintenance easier.

and troubleshooting infinitely harder :wink:
   
   CBR in a "worse case scenario" ATM net, IS TDM, not SDM. NO variance
against TAT allowed.
You might as well say "Real Time" has no business on clear channel T1's.

TDM = Time Division Multiplexing
SDM = Statistical Division Multiplexing.
TAT = Theoretical Arrival Time. (sort of an ATM-cells time slot, like
in TDM)

and there are little demons with respect to the
way RMs are handled that can lead to nasty cases where you
really don't get the bandwidth you ought to.

   Their are LOTS of demons hanging in ATM, and IP, and Multicast, and
....
However, I have NEVER failed to get the bandwith "promised" in our nets.
(Knock on wood) However, I have tried to configure for more than was
there,
and it told me to recalculate, and try again .... And, in some cases
when running
BEYOND the SCR, I lost the extra BW, and received FECN's ...... Slowing
up the Ip.
But, doesn't that same thing happen when you over-run the receiving
router ???

SCR = Sustained Cell Rate
BW = Bandwidth
FECN = Forward Explicit Congestion Notification
BECN = Backwards Explicit Congestion Notification

The problem again is that mixing TCP and other
statmux-smart protocols with ABR introduces two parallel
control loops that have no means of communication other
than the interaction of varying traffic load, varying
delay, and lost data.

   Ahhh.. We await the completion, and proper interaction of RM, ILMI,
and OAM.
These will, (and in some cases already DO), provide that information
back to the router/tag switch.
Now do they use it well ???
That is a different story....

RM = Remote Management
ILMI = a link management interface for ATM
OAM = Operation / Administration Management Cells.

Delay across any fabric of any decent size is largely
determined by the speed of light.

   Where in the world does this come from in the industry.
Maybe I am wrong, but Guys, do the math. The typical run across the
North American Continent
is timed at about 70ms. This is NOT being limited by the speed of
light.

Light can travel around the world 8 times in 1 second. This means it
can travel
once around the world (full trip) in ~ 120 ms. Milliseconds, not
micro....
So, why does one trip across North america take 70ms...

186,000 Miles a second = 1 Mile in 5.38 ^-6 seconds (1 Mile = .00000538
seconds)
Now, the North American Continent is about a 4000 mile trip .... This is
a VERY ROUGH estimate.

4000 x .00000538 = .020 of a second. or 20 ms. not 70ms. Guess where the
rest comes from.
Hint, it is not the speed of light. Time is incurred encoding, decoding,
and routing.

BTW this (70ms median across the US) comes from a predominantly ATM
network. Actually, I
am quoting Pac-Bell.

Therefore, unless ABR
is deliberately inducing queueing delays, there is no way
your delay can be decreased when you send lots of traffic
unless the ATM people have found a way to accelerate
photons given enough pressure in the queues.

   More available bandwidth = quicker transmission.

Ie: at 1000kb/s available, how long does it take to transmit 1000kb ? 1
second.
Now, at 2000kb/s available, how long does it take ? 1/2 second.
What were you saying ?

PS. ABR CAN induce que delays, and often will (and in comes QOS.)
IF the traffic is flagged as delay tolerant, i.e. ABR by
definition......

> Oh, 2 things come to mind, my variability throughout an ATM cloud is
> greatly reduced versus a routing cloud, a cell requires WAY less time to
> cross a switches backplane, versus a packet through a router. And
> seriuosly less time to determine where to send it...

Um you need to be going very slowly and have huge packets
for the passage through a backplane to have any meaning
compared to the centisecon propagation delays observed
on long distance paths.

   Why do you think you have "centi"-second delays in the first place.

   I would check yours, but I find time for a packet to cross a router
backplane to be < 1ms, route
determination in a traditional router can take up to 20 ms (or more),
and slightly less than a 1 ms,
if it is in cache. When I said cross a backplane, I meant "From
hardware ingress to egress", ie to be delivered.

This delay is incurred for every packet, in TRADITIONAL routers !

It is not so much the path across the backplane, as it is the time to
ascertain the destination path. In switches, the route is determined
ONCE for an entire flow. From there on out, it is microseconds. Let me
give
you an example....... you.

traceroute www.clock.org

traceroute to cesium.clock.org (140.174.97.8), 30 hops max, 40 byte
packets
1 OCCIndy-0C3-Ether-OCC.my.net (-.7.18.3) 4 ms 10 ms 10 ms
2 core0-a0-14-ds3.chi1.mytransit.net (-.227.0.173) 16 ms 9 ms 10 ms
3 core0-a3-6.sjc.mytransit.net (-.112.247.145) 59 ms 58 ms 58 ms
4 mae-west.yourtransit.net (-.32.136.36) 60 ms 61 ms 60 ms
5 core1-hssi2-0.san-francisco.yourtransit.net (-.174.60.1) 75 ms 71
ms 76
ms

6 core2-fddi3-0.san-francisco.yourtransit.net (-.174.56.2) 567 ms
154 ms
292 ms
  

  Tell me this is a speed of light issue.
  From the FDDI to the HSSI on the same router.

7 gw-t1.toad.com (-.174.202.2) 108 ms 85 ms 83 ms
8 toad-wave-eth.toad.com (-.174.2.184) 79 ms 82 ms 74 ms
9 zen-wave.toad.com (-.14.61.19) 84 ms 99 ms 75 ms
10 cesium.clock.org (140.174.97.8) 76 ms 83 ms 80 ms
cerebus.my.net> ping www.clock.org

PING cesium.clock.org (140.174.97.8): 56 data bytes
64 bytes from 140.174.97.8: icmp_seq=0 ttl=243 time=93 ms
64 bytes from 140.174.97.8: icmp_seq=1 ttl=243 time=78 ms
64 bytes from 140.174.97.8: icmp_seq=2 ttl=243 time=79 ms
64 bytes from 140.174.97.8: icmp_seq=3 ttl=243 time=131 ms
64 bytes from 140.174.97.8: icmp_seq=4 ttl=243 time=78 ms
64 bytes from 140.174.97.8: icmp_seq=5 ttl=243 time=81 ms
64 bytes from 140.174.97.8: icmp_seq=6 ttl=243 time=75 ms
64 bytes from 140.174.97.8: icmp_seq=7 ttl=243 time=93 ms

Nice and stable, huh. If this path were ATM switched (Dorian, I will
respond to you in another post)
it would have settled to a stable latency.

The analogy between VBR and flow switching confuses me.
Could you explain this a bit? Actually, maybe you could
explain the rest of the message too, because I think we
have a disconnect in terms of vocabulary. :frowning:

  Flow switching does a route determination once per flow, after that
the packets are switched down a predetermined path "The Flow". Hence the
term "flow switching". This reduces the variability of the entire flow.
Great for Voice over IP, etc. However, I should note that the initial
variability to ascertain the flow
is increased. But, not by as much as is being incurred by routing over
the course of the entire flow.

  However, I should also point out that much of your argument is based
in TCP.
Most multimedia (Voice/Audio/Video) content does not focus on TCP, but
UDP/Multicast.
What does your slow start algorithm get you then ?

        Sean.

PS MAC Layer Switching, and ATM switching are apples and oranges.
Although, one could be used to do the other.
(Told you Dorian)

Dorian R. Kim wrote:

> Oh, 2 things come to mind, my variability throughout an ATM cloud is
> greatly reduced versus a routing cloud, a cell requires WAY less time to
> cross a switches backplane, versus a packet through a router. And
> seriuosly less time to determine where to send it...

Hmm... I thought we went over this fallacy not that long ago on NANOG.
Please look up the past NANOG thread with subject _Internet Backbone
Index_.

-dorian

  Dorian, I could be wrong about a LOT of things. However, your thread
in the NANOG is about
network switching vs network routing, not ATM switching. This is Apples
to Oranges.

   Network Switching: Using MAC address to determine next hop.
   Network Routing: Using layer three address to determine next hop.
   ATM Switching: A path set up once, by one or more of the above (or
none),
                  to deliver cells across a multi hop, or point, path.

   Although comparisons can be made between "Network Switching and
Switching", they are not the
same thing. (Although some elements are in common) One is a methodology
to determine a path. The other is a method of delivering a requested
path. Never forget, A full NxN-point Matrix (non-folded) is almost
always faster than a BUS.
                  
   Although, many would call a Matrix a BUS :wink:

Sigh.....

   It reminds of a local Newspaper that was so kind as to point out that
Multicast
is Infinitely Inferior to HDTV. And they had the GUTS to declare that
Multicast would not make it because of this. Can someone point me to the
multi-user conferencing specs for HDTV ?
And what API set do I use to access it ? :wink:

Richard Irving wrote:

Light can travel around the world 8 times in 1 second. This means it
can travel
once around the world (full trip) in ~ 120 ms. Milliseconds, not
micro....

You've got faster light than anybody else. The speed of light
is about 300000 km/s _in vacuum_; that gives 134 ms arond the
planet's equator.

So, why does one trip across North america take 70ms...

a) light is slower in dense media
b) fibers are not laid out in straight lines (in fact, i saw
   a circuit going from Seattle to Vancouver via Fort Worth :slight_smile:

70 ms RTT = 35 ms one way. Given that U.S. is about 50 deg. wide,
it is abput 0.7ms/degree; or 250 ms around the world.

Less than 2 times slower than light in vacuum.

Hint, it is not the speed of light. Time is incurred encoding,
decoding, and routing.

Hint: have a look at a telco's fiber map before spreading
nonsense.

--vadim

Do you? When I mean voice, I mean true POTS. You present dialtone over
IP ?
(It could be done, btw), but POTS over ATM can exit right into the
DEXX's.
(DEXX = Big telco switch) Lets see you allocate an ESF B8ZS Clear
Channel T1 over IP....

Are you sure you aren't referring to a DACS or DCS here? (Digital Access
and Cross-connect System). It's not quite the same as what is usually
referred to as a telco switch, i.e. 5ESS et al.

  CBR in a "worse case scenario" ATM net, IS TDM, not SDM. NO variance
against TAT allowed.

You missed CBR (Constant Bit Rate) in your definitions. CBR is an ATM
service used to do circuit emulation, i.e. deliver the same bits that a TDM
T1 would deliver.

TDM = Time Division Multiplexing
SDM = Statistical Division Multiplexing.

Stat muxing also known as X.25, frame relay, IP, ATM, or packet switching.
In other words rather than dividing the bitstream into fixed timeslots
(TDM) which guarantees that bits will arrive at the other end in a known
time interval, we try to avoid sending bits with no information content and
let another bitstream use the wire therefore the arrival time is not known
but is a statistical probability yadda yadda. TDM is DS3's divided into 28
DS1 timeslots which can each be divided into 24 DS0 timeslots.

when running
BEYOND the SCR, I lost the extra BW, and received FECN's ...... Slowing
up the Ip.
SCR = Sustained Cell Rate

Sustainable Cell Rate. This is an average rate, not a sustained flow of
cells. This is something like a frame CIR in that you can burst beyond the
SCR up to the Peak Cell Rate. It is used in ATM's VBR (Variable Bit Rate)
service.

FECN = Forward Explicit Congestion Notification
BECN = Backwards Explicit Congestion Notification

These are used in Frame Relay. ATM has no BECN but has EFCI (Explicit
Forward Congestion Notification) instead of FECN.

  Ahhh.. We await the completion, and proper interaction of RM, ILMI,
and OAM.

Like IPv4, ATM is also evolving. Which one will win the evolutionary race?

RM = Remote Management

Resource Management. Handled by special RM cells in the data stream.

ILMI = a link management interface for ATM

Integrated Local Management Interface based on SNMP running through a
reserved VBR VC on the network.

Try to remember that there are many people on the list who are not that
familiar with ATM. And given that ATM switches can't route IP packets that
is probably OK. Only a few companies need to give serious thought to the IP
and ATM together.

Cool, I love being talked down to by old guys. It's
refreshing and doesn't happen nearly frequently enough.

I'm almost at a loss to figure out where to begin with the
scattershot flamefest you sent. Almost. Let's start
here:

Lets see you allocate an ESF B8ZS Clear Channel T1 over
IP....

PDH is dead.

POTS is only alive because you can emulate PDH still, and
extracting a single DS0 from SDH is easy, and because the
POTS user interface is well understood by a very large
installed user base. I don't expect POTS as perceived by
the end-user to change much over time.

End-to-end POTS is already dying. Worldcom is making a
big deal over relatively simple technology which shuffles
fax traffic over the Internet. There goes alot of
long-haul POTS right there. Deutsche Telekom is tight
with Vocaltec and already has a tariff for
voice-over-the-Internet. It's crusty in implementation
because you end up dialling a local access number in DT
land and talk to another dialler on the remote end which
makes a more local phone call. However, there are neat
plans for SS7 and neater plans for doing clever things
with interpreting DTMF.

There is a local distribution plant problem however there
are a number of people working on aggregating up local
access lines into VC11/VC12, dropping that into a
POP-in-a-box at STM-16 and pulling out STM-16c to a big
crunchy IP router. In this model POTS and historical
telco voice and data schemes become services rather than
infrastructure.

However, emulating the incredibly ugly phone network is
secondary to enabling the evolution of more and more
complicated and interesting applications; it's also less
cost-effective for the moment than running parallel
networks but converting away from PDH (which, remember, is
dead).

BTW, your formatting sucks to the point that your note is
unreadable and unquotable without fmt(1) or fill-paragraph.

Ahhh, tag switching, I am on that particular holy grail
as well.....

You might want to examine my comments on the mpls list at
some point. "Holy Grail"? No. That was Noel. I know we
look alike and stuff, but he sees a great deal of promise
in MPLS while I am somewhat sceptical about the
implementation and utility in practice.

How many parallel paths have you ran on
layer 3 ? Ever watched the variability ? ( * shiver * )
Now, tell me parallel paths on IP are smooth with todays
technology!

Hum, not more than six hours ago I believe I was
telling Alan Hannan about the various interim survival
techniques in the migration path from 7k+SSP -> decent
routers.

I guess you must have me beat experientially.

All I ever did was sit down with pst and tli and hack and
slash at the 10.2-viktor caching scheme to try to get
traffic to avoid moving over to the stabler of the two
lines between ICM-DC and RENATER.

Oh that and helping beat on CEF/DFIB packet-by-packet load
balancing before my last retirement.

So unfortunately I'm really not in a position to comment
on today's technology or the variability of parallel paths
with Cisco routers using any of the forwarding schemes
from fast to cbus to route-cache to flow to fib. (To be
honest I never really figured out wtf optimum was :slight_smile: ).

The reason that you see "strange" or at least "unsmooth"
load balancing along parallel paths is that except with
fib and slow switching cisco had always forwarded packets
towards the same destination out the same interface, and
load balancing was performed by assigning (upon a cache
fault) particular destinations to particular interfaces.

(Leon was invented to blow away cached entries so that
over time prefixes would slosh about from one interface to
another as they were re-demand-filled into the cache.

Points if you know who Leon and Viktor are. Hint: they're
both as dead as PDH.)

With CEF these days you can load-balance on a per-packet
basis. This has the side effect that you cannot guarantee
that packets will remain in sequence if the one way delay
across the load-balanced paths is off by more than about
half a packet transmission time. However, you also get
much more even link utilization and no ugly
cache/uncache/recache at frequent intervals (which really
sucks because unfortunately you have to push a packet
through the slow path at every recache).

So anyway, as I was saying, I'm ignorant about such
matters.

Audio sounds great with lots of variability

So if you aren't holding a full-duplex human-to-human
conversation you introduce delay on the receiver side
proportional to something like the 95th percentile and
throw away outliers. If you're holding a full-duplex
long-distance human-to-human conversation you can use POTS
(which is dying but which will live on in emulation) and
pay lots of money or you can use one of a number of rather
clever VON member packages and pay alot less money but put
up with little nagging problems. For local or toll-free
stuff, to expect better price-performance from an end-user
perspective now would require taking enormous doses of
reality-altering drugs.

> You want bounded delay on some traffic profiles that
> approach having hard real time requirements. (Anything
> that has actual hard real time requirements has no
> business being on a statistically multiplexed network, no
> matter what the multiplexing fabric is).

  Such as voice? Why do you think SDM was created in the first place?
Or do you mean like a military application, 2ms to respond to a nuke....
That is when channel priorities come into play.

I wasn't around for the invention of statistical muxing,
but I'm sure there are some people here who could clarify
with first-hand knowledge (and I'll take email from you,
thanks :slight_smile: ). If it was created for doing voice, I'm going
to be surprised, because none of the voice literature I've
ever looked was anything but circuit-modeled with TD
muxing of DS0s because that is how God would design a
phone network.

Um, ok, why is it my day for running into arguments about
real time. Hmm...

"Real time" events are those which must be responded to
by a deadline, otherwise the value of the response decays.
In most real time applications, the decay curve varies
substantially with the value of a response dropping to
zero after some amount of time. This is "soft real
time". "Hard real time" is used when the decay curve is
vertical, that is, if the deadline is passed the response
to the event is worthless or worse.

There are very few hard real time things out there.
Anything that is truly in need of hard real time response
should not be done on a statmuxed network or on a wide PDH
network (especially not since PDH is dead in large part
because the propagation delay is inconsistent and
unpredictable thanks to bitstuffing) unless variance in
propagation delay is less than the window for servicing
the hard real time event.

Soft real time things can be implemented across a wide
variety of unpredictable media depending on the window
available to service the real time events and the slope of
the utility decay function.

For instance, interactive voice and video have a number of
milliseconds leeway before a human audience will notice
lag. Inducing a delay to avoid missing the end of the
optimal window for receiving in-sequence frames or blobs
of compressed voice data is wise engineering, particularly
if the induced delay is adjusted to avoid it itself
leading to loss of data utility.

However, I have NEVER failed to get the bandwith
"promised" in our nets.

Sure, the problem is with mixing TCP and other window-based
congestion control schemes which rely on implicit feedback
with a rate-based congestion control scheme, particularly
when it relies on explicit feedback. The problem is
exacerbated when the former overlaps the latter, such that
only a part of the path between transmitter and receiver
is congestion controlled by the same rate-based explicit
feedback mechanism.

What happens is that in the presence of transient
congestion unless timing is very tightly synchronized
(Van Jacobson has some really entertaining rants about
this) the "outer loop" will react by either hovering
around the equivalent of the CIR or by filling the pipe
until the rate based mechanism induces queue drops.
In easily observable pathological cases there is a stair
step or vacillation effect resembling an old TCP sawtooth
pattern rather than the much nicer patterns you get from a
modern TCP with FT/FR/1321 stamps/SACK.

In other words your goodput suffers dramatically.

But, doesn't that same thing happen when you over-run the receiving
router ???

Yes, and with OFRV's older equipment the lack of decent
buffering (where decent output buffering is, per port,
roughly the bandwidth x delay product across the network)
was obvious as bandwidth * delay products increased.

With this now fixed in modern equipment and WRED
available, the implicit feedback is not so much dropped
packets as delayed ACKs, which leads to a much nicer
subtractive slow-down by the transmitter, rather than a
multiplicative backing off.

So, in other words, in a device properly designed to
handle large TCP flows, you need quite a bit of buffering
and benefit enormously from induced early drops.

As a consequence, when the path between transmitter and
receiver uses proper, modern routers, buffer overruns
should never happen in the face of transient congestion.
Unfortunately this is easily seen with many popular
rate-based congestion-control schemes as they react to
transient congestion.

Finally another ABR demon is in the decay of the rate at
which a VS is allowed to send traffic, which in the face
of bursty traffic (as one tends to see with most TCP-based
protocols) throttles goodput rather dramatically. Having
to wait an RTT before an RM cell returns tends to produce
unfortunate effects, and the patch around this is to try
to adjust the scr contract to some decent but low value
and assure that there is enough buffering to allow a VS's
burst to wait to be serviced and hope that this doesn't
worsen the bursty pattern by bunching up alot of data
until an RM returns allowing the queue to drain suddenly.

   Ahhh.. We await the completion, and proper interaction of RM, ILMI,
and OAM.
These will, (and in some cases already DO), provide that information
back to the router/tag switch.
Now do they use it well ???
That is a different story....

The problem is that you need the source to slow
transmission down, and the only mechanism to do that is to
delay ACKs or induce packet drops. Even translating
FECN/BECN into source quench or a drop close to the source
is unhelpful since the data in flight will already lead to
feedback which will slow down the source.

The congestion control schemes are essentially
fundamentally incompatible.

> Delay across any fabric of any decent size is largely
> determined by the speed of light.

   Where in the world does this come from in the industry.
Maybe I am wrong, but Guys, do the math. The typical run across the
North American Continent
is timed at about 70ms. This is NOT being limited by the speed of
light.

That would be round-trip time.

Light can travel around the world 8 times in 1 second. This means it
can travel
once around the world (full trip) in ~ 120 ms. Milliseconds, not
micro....
So, why does one trip across North america take 70ms...

Light is slower in glass.

Hint, it is not the speed of light. Time is incurred encoding, decoding,
and routing.

Kindly redo your calculation with a decent speed of light
value. Unfortunately there is no vacuum between something
in NYC and something in the SF Bay area.

BTW this (70ms median across the US) comes from a
predominantly ATM network. Actually, I am quoting
Pac-Bell.

Oh now THERE's a reliable source. "Hi my name is Frank
and ATM will Just Work. Hi my name is Warren and ATM is
fantastic.". Ugh. (kent bait kent bait kent bait)

> Therefore, unless ABR
> is deliberately inducing queueing delays, there is no way
> your delay can be decreased when you send lots of traffic
> unless the ATM people have found a way to accelerate
> photons given enough pressure in the queues.
>
   More available bandwidth = quicker transmission.

Ie: at 1000kb/s available, how long does it take to transmit 1000kb ? 1
second.
Now, at 2000kb/s available, how long does it take ? 1/2 second.
What were you saying ?

At higher bandwidths bits are shorter not faster.

Repeat that several times.

Whether you are signalling at 300bps or at
293875983758917538924372589bps, the start of the first bit
arrives at the same time.

   Why do you think you have "centi"-second delays in the first place.

Because photons and electrons are slow in glass and copper.

I would check yours, but I find time for a packet to
   cross a router > backplane to be < 1ms, route >
   determination in a traditional router can take up to
   20 ms (or more), > and slightly less than a 1 ms, >
   if it is in cache. When I said cross a backplane, I
   meant "From > hardware ingress to egress", ie to be
   delivered.

You are still stuck thinking of routers as things which
demand-fill a cache by dropping a packet through a slow
path. This was an artefact of OFRV's (mis)design, and the
subject of many long and interesting rants by Dennis
Ferguson on this list a couple of years ago.

Modern routers simply don't do this, even the ones from OFRV.

traceroute to cesium.clock.org (140.174.97.8), 30 hops max, 40 byte
packets

6 core2-fddi3-0.san-francisco.yourtransit.net (-.174.56.2) 567 ms
154 ms
292 ms

>>>>>>>>>>>>>> Tell me this is a speed of light issue.
>>>>>>>>>>>>>> From the FDDI to the HSSI on the same router.

This has nothing to do with the router's switching or
route lookup mechanism. Router requirements allow routers
to be selective in generating ICMP messages, and cisco's
implementation on non-CEF routers will hand the task of
generating ICMP time exceededs, port unreachables and echo
replies to the main processor, which gets to the task as a
low priority when it's good and ready. If the processor
is doing anything else at the time you get rather long
delays in replies, and if it's busy enough to start doing
SPD you get nothing.

This gets talked about quite frequently on the NANOG
list. I suggest you investigate the archives. I'm sure
Michael Dillon can point you at them. He's good at that.

PING cesium.clock.org (140.174.97.8): 56 data bytes
64 bytes from 140.174.97.8: icmp_seq=0 ttl=243 time=93 ms
64 bytes from 140.174.97.8: icmp_seq=1 ttl=243 time=78 ms
64 bytes from 140.174.97.8: icmp_seq=2 ttl=243 time=79 ms
64 bytes from 140.174.97.8: icmp_seq=3 ttl=243 time=131 ms
64 bytes from 140.174.97.8: icmp_seq=4 ttl=243 time=78 ms
64 bytes from 140.174.97.8: icmp_seq=5 ttl=243 time=81 ms
64 bytes from 140.174.97.8: icmp_seq=6 ttl=243 time=75 ms
64 bytes from 140.174.97.8: icmp_seq=7 ttl=243 time=93 ms

Nice and stable, huh. If this path were ATM switched (Dorian, I will
respond to you in another post)
it would have settled to a stable latency.

There is extraordinary congestion in the path between your
source and cesium.clock.org, and cesium is also rather
busy being CPU bound on occasion. There is also a
spread-spectrum radio link between where it lives (the
land of Vicious Fishes) and "our" ISP (Toad House), and
some of the equipment involved in bridging over that is
flaky.

If you were ATM switching over that link you would see the
same last-hop variability because of that physical level
instability.

It works great for IP though, and I quite happily am
typing this at an emacs thrown up onto my X display in
Scandinavia across an SSH connection.

  Flow switching does a route determination once per flow, after that
the packets are switched down a predetermined path "The Flow". Hence the
term "flow switching". This reduces the variability of
the entire flow.

Um, no it doesn't. As with all demand-cached forwarding
schemes you have to process a packet heavily when you have
a cache miss. Darren Kerr did some really neat things to
make it less disgusting than previous demand-cached
switching schemes emanating out of OFRV, particularly
with respect to gleaning lots of useful information out of
the side-effects of a hand-tuned fast path that was
designed to account for all the header processing one
could expect.

Flow switching does magic matching of related packets to
cache entries which describe the disposition of the packet
that in previous caching schemes could only be determined
by processing individual packets to see if they matched
various access lists and the like. It's principal neat
feature is that less per-packet processing means more pps
throughput.

MPLS is conceptually related, btw.

Flow switching does not improve queueing delays or speed
up photons and electrons, however, nor does it worsen
them, therefore the effect of flow switching on
variability of normal traffic is nil.

Flow switching has mechanisms cleverer than Leon the
Cleaner to delete entries from the cache and consequently
there are much reduced odds of a cache fault during a
long-lived flow that is constantly sending at least
occasional traffic. You may see this as reducing
variability. I see it as fixing a openly-admitted design flaw.

  However, I should also point out that much of your
argument is based in TCP. Most multimedia
(Voice/Audio/Video) content does not focus on TCP, but
UDP/Multicast. What does your slow start algorithm get
you then ?

WRED and other admissions control schemes are being
deployed that will penalize traffic that is out of
profile, i.e., that doesn't behave like a reasonable TCP
behaves. Most deployed streaming technologies have taken
beatings from ISPs (EUNET, for example with CUSEEME,
Vocaltec and Progressive with a wide range of ISPs) and
have implemented congestion avoidance schemes that closely
mimic TCP's, only in some cases there is no retransmission
scheme.

PS MAC Layer Switching, and ATM switching are apples and oranges.
Although, one could be used to do the other.
(Told you Dorian)

Huh?

  Sean.

P.S.: You have some very entertaining and unique expansion
  of a number of acronyms in general and relating to
  ATM in particular. I'm curious how you expand "PDH".
  Personally I favour "Pretty Damn Historical",
  although other epithets come to mind.

P.P.S.: It's dead.

Quote from Jim Steinhardt's <jsteinha@cisco.com>
personal message:

a) light is slower in dense media

  The index of refraction of glass is 1.5 vs 1.0 for
a vacuum. Hence, the speed of light in glass is 2 * 10 **8 m/s.

That gives 60 ms RTT on 4000 mile line.

Case closed.

Thanks, Jim!

--vadim

Yep, but you also need to add a few ms for electronics in that 4000 mile
line. We tend to see around 64 - 65 ms delay between our DCA and PAL
routers.

rt1.DCA.netrail.net# ping rt1.PAL
ICMP ECHO rt1.PAL.netrail.net (205.215.45.33): 56 data bytes
64 bytes from 205.215.45.33: icmp_seq=0 ttl=255 time=65.764 ms
64 bytes from 205.215.45.33: icmp_seq=1 ttl=255 time=64.851 ms
64 bytes from 205.215.45.33: icmp_seq=2 ttl=255 time=65.053 ms
64 bytes from 205.215.45.33: icmp_seq=3 ttl=255 time=64.994 ms
^C
--- rt1.PAL.netrail.net ICMP ECHO statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 64.851/65.160/65.764 ms
rt1.DCA.netrail.net#

Nathan Stratton President, CTO, NetRail,Inc.

Nathan Stratton wrote:

> Quote from Jim Steinhardt's <jsteinha@cisco.com>
> personal message:
>
> >>a) light is slower in dense media
>
> > The index of refraction of glass is 1.5 vs 1.0 for
> > a vacuum. Hence, the speed of light in glass is 2 * 10 **8 m/s.
>
> That gives 60 ms RTT on 4000 mile line.
>
> Case closed.

Yep, but you also need to add a few ms for electronics in that 4000 mile
line.

  My original point. (Concerning latency)

We tend to see around 64 - 65 ms delay between our DCA and PAL
routers.

rt1.DCA.netrail.net# ping rt1.PAL
ICMP ECHO rt1.PAL.netrail.net (205.215.45.33): 56 data bytes
64 bytes from 205.215.45.33: icmp_seq=0 ttl=255 time=65.764 ms
64 bytes from 205.215.45.33: icmp_seq=1 ttl=255 time=64.851 ms
64 bytes from 205.215.45.33: icmp_seq=2 ttl=255 time=65.053 ms
64 bytes from 205.215.45.33: icmp_seq=3 ttl=255 time=64.994 ms
^C

   How may hops is that ping ? I am curious, this is interesting.
(Even if I did get mail bombed by responses...... :wink:

   You realize we need to be comparing a *series* of routers, vs a
*series* of ATM
switches. (this is the real world, and we are modelling delivery to
anywhere, not just
1 hop across the continent.)

  It would be interesting to see someone set up a performance test.

Latency/Variability with no hops,
Latency/Variability with 1 router vs 1 switch
Latency/Variability with 2 routers vs 2 switches
Up to about 10. I wonder what that curve would look like ?
Media would have to be consistent in size (ie DS3, all the way through)

Interesting , no ?

--- rt1.PAL.netrail.net ICMP ECHO statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 64.851/65.160/65.764 ms
rt1.DCA.netrail.net#

Nathan Stratton President, CTO, NetRail,Inc.
------------------------------------------------------------------------
Phone (888)NetRail NetRail, Inc.
Fax (404)522-1939 230 Peachtree Suite 500
WWW http://www.netrail.net/ Atlanta, GA 30303
------------------------------------------------------------------------
"No king is saved by the size of his army; no warrior escapes by his
great strength. - Psalm 33:16

Richard

Yep, but you also need to add a few ms for electronics in that 4000 mile
   line. We tend to see around 64 - 65 ms delay between our DCA and PAL
   routers.

The delays in the electronics in the repeaters (no DACSes in a big ISP
pipe) and muxes along the way don't add up to more than a few tens of
microseconds. Disappears in the noise compared to other factors on
the line. In fact, I daresay the jitter in interrupt service time on
the host on the other end (even on a unix box, let alone a cisco
serving icmp replies in its copious spare time) exceeds the electronic
delays by an order of magnitude or more. Then there's the delay in
getting your packets onto and off of the CSMA/CD network that you
indubitably have at the far end....

                                        ---rob

Lets see you allocate an ESF B8ZS Clear Channel T1 over
IP....

PDH is dead.

POTS is only alive because you can emulate PDH still, and
extracting a single DS0 from SDH is easy, and because the
POTS user interface is well understood by a very large
installed user base.

PDH = Plesiochronous Digital Hierarchy
This refers to muxing up DS0's into DS1's into DS2's into DS3's, i.e. TDM
stuff over twisted copper pairs.

SDH = Synchronous Digital Hierarchy and refers to optical fiber links like
OC3, OC12 and OC48. In North America the SDH standard is called SONET
(Synchronous Optical Network).

However, there are neat
plans for SS7 and neater plans for doing clever things
with interpreting DTMF.

Signalling System Number 7 is the packet switched control protocol used by
the telephone networks to exchange control and billing and caller-id and
similar info. DTMF (Dual Tone Multi Frequency) is an in-band signalling
mechanism more commonly called "touch tones".

on aggregating up local
access lines into VC11/VC12, dropping that into a
POP-in-a-box at STM-16 and pulling out STM-16c to a big
crunchy IP router.

I don't know everything. Sean will have to translate VC11 and VC12.
STM means Synchronous Transfer Mode. An STM-1 has the same bandwidth of OC3
so an STM16 will have the same bandwidth as OC48. Therefore STM-16c will be
like OC48c in that the bandwith is "concatenated" into a single big pipe
rather than a bunch of smaller channels.

All I ever did was sit down with pst and tli and hack and
slash at the 10.2-viktor caching scheme to try to get
traffic to avoid moving over to the stabler of the two
lines between ICM-DC and RENATER.

Again, Sean will have to identify pst, but tli is Tony Li of Juniper
formerly of Cisco and 10.2-viktor is one of many IOS versions. I suppose
most people realize that ICM-DC is ICM in Washington DC and RENATER is a
large network in France.

Oh that and helping beat on CEF/DFIB packet-by-packet load
balancing before my last retirement.

Once again, I'm at a loss as ti what CEF/DFIB stands for.

Points if you know who Leon and Viktor are. Hint: they're
both as dead as PDH.)

Thinking of dead philosophies made me think of Leon Trotskiy by Viktor
escapes me. Obviously not omniscient, eh? :slight_smile:

Finally another ABR demon is in the decay of the rate at
which a VS is allowed to send traffic, which in the face

ABR (Available Bit Rate) is another ATM service something like VBR with
flow control and guarantees. The VC is a bunch of concatenated
subconnections and a VS (Virtual Source) is the end point of one of these
subconnections.

This gets talked about quite frequently on the NANOG
list. I suggest you investigate the archives. I'm sure
Michael Dillon can point you at them. He's good at that.

*grin*

Actually I *WAS* about to suggest this very thing.

You see folks, Sean packs a whole lot of information into these messages
and whether you agree with his conclusions or not it would pay off to
review the meanings of the various acronyms he uses and then head off to
the archives conveninetly located at http://www.nanog.org and re-read his
recent postings. Even if he is wrong about some of these things he has
clearly done a lot of research and thinking about the problems we all face
in deploying a global ubiquitous always-on network.

...

Now, the North American Continent is about a 4000 mile trip .... This is
a VERY ROUGH estimate.

DARN! And here United has been crediting me with just 24xx Frequent Flier
miles for the trip. I wanna recount!

--SG

> on aggregating up local
>access lines into VC11/VC12, dropping that into a
>POP-in-a-box at STM-16 and pulling out STM-16c to a big
>crunchy IP router.

I don't know everything. Sean will have to translate VC11 and VC12.

VC = Virtual Containers. Basically a way of repackaging, or adapting PDH
singls to STM-1 frame. There a many different VCs, and I don't know
which ones 11 and 12 are without looking them up.

Again, Sean will have to identify pst, but tli is Tony Li of Juniper

Pst is Paul Traina, formerly of Cisco and currently Juniper.

>Oh that and helping beat on CEF/DFIB packet-by-packet load
>balancing before my last retirement.

Once again, I'm at a loss as ti what CEF/DFIB stands for.

Cisco Express Forwarding/Distributed Forwarding Information Base. Former
being another stupid marketing stunt while the latter being a more
descriptive name. Basically it builds a full forwarding table that's
downloaded to the individual linecards, freeing the RP to do what it's
supposed to do, i.e. route calculationgs, and freeing network operators
from the evils of cache.

Some of the neat things that comes with this trick are the per-packet load
balancing, per adjacency and per prefix accounting among others.

-dorian

Again, Sean will have to identify pst, but tli is Tony Li of Juniper
formerly of Cisco and 10.2-viktor is one of many IOS versions. I suppose
most people realize that ICM-DC is ICM in Washington DC and RENATER is a
large network in France.

So far three people have informed me that pst is Paul S. Traina currently
with Shockwave.

Oh that and helping beat on CEF/DFIB packet-by-packet load
balancing before my last retirement.

Once again, I'm at a loss as to what CEF/DFIB stands for.

And a couple of replies on this one... here's the OFRV employee version:

CEF (Cisco Express Forwarding) is the marketing name of what
we originally called FIB (Forwarding Information Base, where
the FIB is initially completely populated, instead of traffic-
driven cached). The dCEF/dFIB reference is where the FIB is
pushed onto the VIP (Versatile Interface Processor) cards,
otherwise known as the linecards.

Richard Irving wrote:

Nathan Stratton wrote:
>
>
> > Quote from Jim Steinhardt's <jsteinha@cisco.com>
> > personal message:
> >
> > >>a) light is slower in dense media
> >
> > > The index of refraction of glass is 1.5 vs 1.0 for
> > > a vacuum. Hence, the speed of light in glass is 2 * 10 **8 m/s.
> >
> > That gives 60 ms RTT on 4000 mile line.
> >
> > Case closed.
>
> Yep, but you also need to add a few ms for electronics in
> that 4000 mile line.

  My original point. (Concerning latency)

You didn't understand, did you? All your calculations were
to disprove Sean's point that the electronics and switching
delays are pretty small (so as to be insignificant) as compared
to the signal propagation delay. Your (wildly inaccurate)
estimate was that more than 50% of time is spent in electronics.

The real figures, however, show that it is at most 5%. I strongly
suspect that if you figure in that fibers aren't going in straight
line, you'll get that down to 0.1-0.5%.

--vadim