packet reordering at exchange points

Hmmmm. You're right. I lost sight of the original thread...
GigE inter-switch trunking at PAIX. In that case, congestion
_should_ be low, and there shouldn't be much queue depth.

indeed, this is the case. we keep a lot of headroom on those trunks.

But this _does_ bank on current "real world" behavior. If
endpoints ever approach GigE speeds (of course requiring "low
enough" latency and "big enough" windows)...

Then again, last mile is so slow that we're probably a ways away
from that happening.

my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s,
exchange points will all be operating at 10Gb/s, and interswitch trunks
at exchange points will be multiples of 10Gb/s.

Of course, I'd hope that individual heavy pairs would establish
private interconnects instead of using public switch fabric, but
I know that's not always { an option | done | ... }.

individual heavy pairs do this, but as a long term response to growth,
not as a short term response to congestion. in the short term, the
exchange point switch can't present congestion. it's just not on the
table at all.

Date: Tue, 09 Apr 2002 11:16:24 -0700
From: Paul Vixie <paul@vix.com>

my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s,
exchange points will all be operating at 10Gb/s, and interswitch trunks
at exchange points will be multiples of 10Gb/s.

I guess Moore's Law comes into play again. One will need some
pretty hefty TCP buffers for a single stream to hit those rates,
unless latency _really_ drops. (Distributed CDNs, anyone? Speed
of light ain't getting faster any time soon...)

Of course, IMHO I expect DCDNs to become increasingly common...
but that topic would warrant a thread fork.

Looks like RR ISLs are feasible between GigE+ core switches...

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
25MB of data. According to pricewatch, I can pick up a high density 512MB
PC133 DIMM for $70, and use $3.50 of it to catch that TCP stream. Throw in
$36 for a GigE NIC, and we're ready to go for under $40. Yeah I know thats
cheapest garbage you can get, but this is just to prove a point. :slight_smile: I
might only be able to get 800Mbit across a 32bit/33mhz PCI bus, but
whatever.

The problem isn't the lack of hardware, it's a lack of good software (both
on the receiving side and probably more importantly the sending side), a
lot of bad standards coming back to bite us (1500 byte packets is about as
far from efficient as you can get), a lack of people with enough know-how
to actually build a network that can transport it all (heck they can't
even build decent networks to deliver 10Mbit/s, @Home was the closest),
and just a general lack of things for end users to do with that much
bandwidth even if they got it.

Date: Tue, 9 Apr 2002 16:03:53 -0400
From: Richard A Steenbergen <ras@e-gerbil.net>

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
25MB of data. According to pricewatch, I can pick up a high density 512MB

[ snip ]

The problem isn't the lack of hardware, it's a lack of good software (both

[ snip ]

But how many simultaneous connections? Until TCP stacks start
using window autotuning (of which I know you're well aware), we
must either use suboptimal windows or chew up ridiculous amounts
of memory. Yes, bad software, but still a limit...

It would be nice to allocate a 32MB chunk of RAM for buffers,
then dynamically split it between streams. Fragmentation makes
that pretty much impossible.

OTOH... perhaps that's a reasonable start:

1. Alloc buffer of size X
2. Let it be used for Y streams
3. When we have Y streams, split each stream "sub-buffer" into Y
   parts, giving capacity for Y^2, streams.

Aggregate transmission can't exceed line rate. So instead of
fixed-size buffers for each stream, perhaps our TOTAL buffer size
should remain constant.

Use PSC-style autotuning to eek out more capacity/performance,
instead of using fixed value of "Y" or splitting each and every
last buffer. (Actually, I need to reread/reexamine the PSC code
in case it actually _does_ use a fixed total buffer size.)

This shouldn't be terribly hard to hack into an IP stack...

But how many simultaneous connections? Until TCP stacks start
using window autotuning (of which I know you're well aware), we
must either use suboptimal windows or chew up ridiculous amounts
of memory. Yes, bad software, but still a limit...

Thats precisely what I ment by bad software, as well as the server code
that pushes the data out in the first place. And for that matter, the
receiver side is just as important.

It would be nice to allocate a 32MB chunk of RAM for buffers,
then dynamically split it between streams. Fragmentation makes
that pretty much impossible.

OTOH... perhaps that's a reasonable start:

1. Alloc buffer of size X
2. Let it be used for Y streams
3. When we have Y streams, split each stream "sub-buffer" into Y
   parts, giving capacity for Y^2, streams.

You don't actually allocate the buffers until you have something to put in
them, you're just fixing a limit on the maximum you're willing to
allocate. The problem comes from the fact that you're fixing the limits on
a "per-socket" basis, not on a "total system" basis.

Aggregate transmission can't exceed line rate. So instead of
fixed-size buffers for each stream, perhaps our TOTAL buffer size
should remain constant.

Use PSC-style autotuning to eek out more capacity/performance,
instead of using fixed value of "Y" or splitting each and every
last buffer. (Actually, I need to reread/reexamine the PSC code
in case it actually _does_ use a fixed total buffer size.)

This shouldn't be terribly hard to hack into an IP stack...

Actually here's an even simpler one. Define a global limit for this,
something like 32MB would be more then reasonable. Then instead of
advertising the space "remaining" in individual socket buffers, advertise
the total space remaining in this virtual memory pool. If you overrun your
buffer, you might have the other side send you a few unnecessary bytes
that you just have to drop, but the situation should correct itself very
quickly. I don't think this would be "unfair" to any particular flow,
since you've eliminated the concept of one flow "hogging" the socket
buffer and leave it to TCP to work out the sharing of the link. Second
opinions?

Date: Tue, 9 Apr 2002 19:17:44 -0400
From: Richard A Steenbergen <ras@e-gerbil.net>

[ snip beginning ]

Actually here's an even simpler one. Define a global limit for
this, something like 32MB would be more then reasonable. Then
instead of advertising the space "remaining" in individual

My static buffer presumed that one would regularly see line rate;
that's probably an invalid assumption.

socket buffers, advertise the total space remaining in this

Why bother advertising space remaining? Simply take the total
space -- which is tuned to line rate -- and divide equitably.
Equal division is the primitive way. Monitoring actual buffer
use, a la PSC window-tuning code, is more efficient.

To respect memory, sure, you could impose a global limit and
alloc as needed. But on a "busy enough" server/client, how much
would that save? Perhaps one could allocate 8MB chunks at a
time... but fragmentation could prevent the ability to have a
contiguous 32MB in the future. (Yes, I'm assuming high memory
usage and simplistic paging. But I think that's plausible.)

Honestly... memory is so plentiful these days that I'd gladly
devote "line rate"-sized buffers to the cause on each and every
server that I run.

virtual memory pool. If you overrun your buffer, you might have
the other side send you a few unnecessary bytes that you just
have to drop, but the situation should correct itself very

By allocating 32MB, one stream could achieve line rate with no
wasted space (assuming latency is exactly what we predict, which
we all know won't happen). When another stream or two are
opened, we split the buffer into four. Maybe we drop, like you
suggest, in a RED-like manner. Maybe we flush the queue if it's
not "too full".

Now we have up to four streams, each with an 8MB queue. Need
more streams? Fine, split { one | some | all } of the 8MB
windows into 2MB segments. Simple enough, until we hit the
variable bw*delay times... then we should use intelligence when
splitting, probably via mechanisms similar to the PSC stack.

Granularity of 4 is for example only. I know that would be
non-ideal. One could split 32 MB into 6.0 MB + 7.0 MB + 8.5 MB +
10.5 MB, which would then be halved as needed. Long-running
sessions could be moved between buffer clumps as needed. (i.e.,
if 1.5 MB is too small and 2.0 MB is too large, 1.75 MB fits
nicely into the 7.0 MB area.)

quickly. I don't think this would be "unfair" to any particular
flow, since you've eliminated the concept of one flow
"hogging" the socket buffer and leave it to TCP to work out the
sharing of the link. Second opinions?

Smells to me like ALTQ's TBR (token buffer regulator).

Perhaps also have a dynamically-allocated "tuning" buffer:
Imagine 2000 dialups and 10 DSL connections transferring over a
DS3... use a single "big enough" buffer (few buffers?) to sniff
out each stream's capability, to determine which stream can use
how much more space.

My static buffer presumed that one would regularly see line rate;
that's probably an invalid assumption.

Indeed. But thats why it's not an actual allocation.

Why bother advertising space remaining? Simply take the total
space -- which is tuned to line rate -- and divide equitably.
Equal division is the primitive way. Monitoring actual buffer
use, a la PSC window-tuning code, is more efficient.

Because then you havn't accomplished your goal. If you have 32MB of buffer
memory available, and you open 32 connections and share it equally for
1MB/ea, you could have 1 connection that is doing no bandwidth and one
connection that wants to scale to more then 1MB of packets inflight. Then
you have to start scanning all your connections on a periodic basis
adjusting the socket buffers to reflect the actual congestion window, a
la PSC.

My suggestion was to cut out all that non-sense by simply removing the
received window limits all together. Actually you could accomplish this
goal by just advertising the maximum possible window size and rely on
packet drops to shrink the congestion window on the sending side as
necessary, but this would be slightly less efficient in the case of a
sender overrunning the receiver.

But alas we're both forgetting the sender side, which controls how quickly
data moves from userland into the kernel. This part must be set by looking
at the sending congestion window. And I thought of another problem as
well. If you had a receiver which made a connection, requested as much
data as possible, and then never did a read() on the socket buffer, all
the data would pile up in the kernel and consume the total buffer space
for the entire system.

To respect memory, sure, you could impose a global limit and
alloc as needed. But on a "busy enough" server/client, how much
would that save? Perhaps one could allocate 8MB chunks at a
time... but fragmentation could prevent the ability to have a
contiguous 32MB in the future. (Yes, I'm assuming high memory
usage and simplistic paging. But I think that's plausible.)

You're missing the point, you don't allocate ANYTHING until you have a
packet to fill that buffer, and then when you're done buffering it, it is
free'd. The limits are just there to prevent you from running away with a
socket buffer.

Date: Tue, 9 Apr 2002 20:39:34 -0400
From: Richard A Steenbergen <ras@e-gerbil.net>

My suggestion was to cut out all that non-sense by simply removing the
received window limits all together. Actually you could accomplish this
goal by just advertising the maximum possible window size and rely on
packet drops to shrink the congestion window on the sending side as
necessary, but this would be slightly less efficient in the case of a
sender overrunning the receiver.

But alas we're both forgetting the sender side, which controls how quickly
data moves from userland into the kernel. This part must be set by looking
at the sending congestion window. And I thought of another problem as

Actually, I was thinking more in terms of sending than receiving.
Yes, your approach sounds quite slick for the RECV side, and I
see your point. But WND info will be negotiated for sending...
so why not base it on "splitting the total pie" instead of
"arbitrary maximum"?

well. If you had a receiver which made a connection, requested as much
data as possible, and then never did a read() on the socket buffer, all
the data would pile up in the kernel and consume the total buffer space
for the entire system.

Unless, again, there's some sort of limit. 32 MB total, 512
connections, each socket gets 64 kB until it proves its worth.
Sockets don't get to play the RED-ish game until they _prove_
that they're serious about sucking down data.

Once a socket proves its intentions (and periodically after
that), it gets to use a BIG buffer, so we find out just how fast
the connection can go.

You're missing the point, you don't allocate ANYTHING until you have a
packet to fill that buffer, and then when you're done buffering it, it is
free'd. The limits are just there to prevent you from running away with a
socket buffer.

No, I understand your point perfectly, and that's how it's
currently done.

But why even bother with constant malloc(9)/free(9) when the
overall buffer size remains reasonably constant? i.e., kernel
allocation to IP stack changes slowly if at all. IP stack alloc
to individual streams changes regularly.

That doesn't prevent an intentional local DoS though.

Rough attempt at processing rules:

1. If "enough" buffer space, let each stream have its fill.
    DONE.

2. Not "enough" buffer space, so we must invoke limits.

3. If new connection, impose low limit until socket proves its
    intentions... much like not allocating an entire socket
    struct until TCP handshake is complete, or TCP slow start.
    DONE.

4. It's an existing connection.

5. Does it act like it could use a smaller window? If so,
    shrink the window. DONE.

6. Stream might be able to use a larger window.

7. Is it "tuning time" for this stream according to round robin
    or random robin? If so, use BIG buffer for a few packets,
    measuring the stream's desires.

8. Does the stream want more buffer space? If not, DONE.

9. Is it fair to other streams to adjust window? If not, DONE.

10. Adjust appropriately.

I guess this shoots my "split into friendly fractions" approach
out of the water... and we're back to "standard" autotuning (for
sending) once we enforce minimum buffer size.

Major differences:

+ We're saying to approach memory usage macroscopically instead
  of microscopically. i.e., per system instead of per stream.

+ We're removing upper bounds when bandwidth is plentiful.

+ Receive like you suggested, save for the "low memory" start
  phase.

Date: Tue, 9 Apr 2002 21:12:30 -0400
From: Richard A Steenbergen <ras@e-gerbil.net>

That doesn't prevent an intentional local DoS though.

And the current stacks do? (Note that my 64 kB figure was an
example, for an example system that had 512 current connections.)

Okay, how about new sockets split "excess" buffer space, subject
to certain minimum size restrictions? New sockets do not impact
establish streams, unless we have way too many sockets or too
little buffer space.

If way too many sockets, it's just like current stacks, although
hopefully ulimit would prevent this scenario.

If we're out of buffer space, then we're going to have even more
problems when the sockets are actually passing data.

Yes, I'm still thinking about carving up a 32 MB chunk of RAM,
shrinking window sizes when we need more buffers.

Of course, we probably should consider AIO, too... if we can
have buffers in userspace instead of copying from kernel to
user via read(), that makes memory issues a bit more pleasant.

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
25MB of data. According to pricewatch, I can pick up a high density 512MB

Why ?

I am still waiting (after many years) for anyone to explain to me the issue
of buffering. It appears to be completely unneccesary in a router.

Everyone seems to answer me with 'bandwidth x delay product' and similar,
but think about IP routeing. The intermediate points are not doing any form
of per-packet ack etc. and so do not need to have large windows of data etc.

I can understand the need in end-points and networks (like X.25) that do
per-hop clever things...

Will someone please point me to references that actually demonstrate why an
IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Peter

OK, what I am missing? Unless I'm misunderstanding your question, this
seems relatively simplistic and the need for buffers on routes is
actually quite obvious.

Imagine a router with more than 2 interfaces, each interface being of
the same speed. Packets arrive on 2 or more interfaces and each need to
be forwarded onto the same outbound interface. Imagine packets arrive
at exactly or roughly the same time. Since the bits are going out
serially, you're gonna need to buffer packets one behind the others on
the egress interface.

Similar scenarios occur when egress interface capacity is less than some
rate or aggregate rate wanting to exit via that interface.

John

Note that the previous example was about end to end systems achieving line
rate across a continent, nothing about routers was mentioned.

Why ?

I am still waiting (after many years) for anyone to explain to me the issue
of buffering. It appears to be completely unneccesary in a router.

Well, that's some challenge but I'll have a go :-/

As far as I can tell, the use of buffering has to do with traffic shaping vs. rate limiting. If you have a buffer on the interface, you are doing traffic shaping -- whether or not your vendor calls it that. That's because when the rate at which traffic arrives at the queue exceeds the rate that it leaves the queue, the packets get buffered for transmission some time later. In effect, the queue buffers traffic bursts and then spreads transmission of the buffered packets over time.

If you have no queue or a very small queue (relative to the Rate x Average packet size) and the arrival rate exceeds transmission rate, you can't buffer the packet to transmit later, and so simply drop it. This is rate limiting.

That's my theory, but what's the effect?

I have seen the difference in effect on a real network running IP over ATM. The ATM core at this large European service provider was running equipment from "Vendor N". N's ATM access switches have very small cell buffers -- practically none, in fact.

When we connected routers to this core from "vendor C" that didn't have much buffering on the ATM interfaces, users saw very poor e-mail and HTTP throughput. We discovered that this was happening because during bursts of traffic, there were long trains of sequential packet loss -- including many TCP ACKs. This caused the TCP senders to rapidly back off their transmit windows. That and the packet loss was the major cause of poor throughput. Although we didn't figure this out until much later, a side effect of the sequential packet loss (i.e. no drop policy) was to synchronize all of the TCP senders -- i.e. the "burstyness" of the traffic got worse because now all of the TCP senders were trying to increase their send windows at the same.

To fix the problem, we replaced the ATM interface cards on the routers -- it turns out Vendor C has an ATM interface with lots of buffering, configurable drop policy (we used WRED) and a cell-level traffic shaper, presumably to address this very issue. The users saw much improved e-mail and web performance and everyone was happy, except for the owner of the routers who wanted to know why they had to buy the more expensive ATM card (i.e. why couldn't the ATM core people couldn't put more buffering on their ATM access ports).

Hope this helps,

Mathew

Note that the previous example was about end to end systems achieving line
rate across a continent, nothing about routers was mentioned.

Fair enough - for that I can see the point. Maybe I need to read more though
:slight_smile:

Peter

Thus spake "Mathew Lodge" <mathew@cplane.com>

>Why ?
>
>I am still waiting (after many years) for anyone to explain to me
>the issue of buffering. It appears to be completely unneccesary
>in a router.

Well, that's some challenge but I'll have a go :-/

As far as I can tell, the use of buffering has to do with traffic
shaping vs. rate limiting. If you have a buffer on the interface,
you are doing traffic shaping -- whether or not your vendor calls
it that. ... If you have no queue or a very small queue ... This is
rate limiting.

Well, that's implicit shaping/policing if you wish to call it that. It's
only common to use those terms with explicit shaping/policing, i.e. when you
need to shape/police at something other than line rate.

except for the owner of the routers who wanted to know why
they had to buy the more expensive ATM card (i.e. why
couldn't the ATM core people couldn't put more buffering on
their ATM access ports).

The answer here lies in ATM switches being designed primarily for carriers
(and by people with a carrier mindset). Carriers, by and large, do not want
to carry unfunded traffic across their networks and then be forced to buffer
it; it's much easier (and cheaper) to police at ingress and buffer nothing.

It would have been nice to see a parallel line of switches (or cards) with
more buffers. However, anyone wise enough to buy those was wise enough to
ditch ATM altogether :slight_smile:

S

Thus spake "Peter Galbavy" <peter.galbavy@knowtion.net>

Why ?

I am still waiting (after many years) for anyone to explain to me
the issue of buffering. It appears to be completely unneccesary in
a router.

Routers are not non-blocking devices. When an output port is blocked,
packets going to that port must be either buffered or dropped. While it's
obviously possible to drop them, like ATM/FR carriers do, ISPs have found
they have much happier customers when they do a reasonable amount of
buffering.

S

> To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
> 25MB of data. According to pricewatch, I can pick up a high density

512MB

Why ?

I am still waiting (after many years) for anyone to explain to me the

issue

of buffering. It appears to be completely unneccesary in a router.

Everyone seems to answer me with 'bandwidth x delay product' and similar,
but think about IP routeing. The intermediate points are not doing any

form

of per-packet ack etc. and so do not need to have large windows of data

etc.

I can understand the need in end-points and networks (like X.25) that do
per-hop clever things...

Will someone please point me to references that actually demonstrate why

an

IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Sure, see the original Van Jacobson-Mike Karels paper "Congestion Avoidance
and Control", at http://www-nrg.ee.lbl.gov/papers/congavoid.pdf. Briefly,
TCP end systems start pumping packets into the path until they've gotten
about RTT*BW worth of packets "in the pipe". Ideally these packets are
somewhat evenly spaced out, but in practice in various circumtances they can
get clumped together at a bottleneck link. If the bottleneck link router
can't handle the burst then some get dumped.

  -- Jim