RE: 923Mbits/s across the ocean

Also as the OS's are shipped they come with small default maximum window sizes (I think Linux is typically 64KB and Solaris is 8K), and so one has to get the sysadmin with root privs to change this.

This is related to how the kernel/user model works in relation to TCP.
TCP itself happens in the kernel, but the data comes from userland through
the socket interface, so there is a "socket buffer" in the kernel which
holds data coming from and going to the application. TCP cannot release
data from it's buffer until it has been acknowledged by the other side,
incase it needs to retransmit. This means TCP performance is limited by
the smaller of either the congestion window (determined by measuring
conditions along the path), or the send/recv window (determined by local
system resources).

However, you can't just blindly turn up your socket buffers to large
values and expect good results.

On the send size, the application transmitting is guaranteed to utilize
the buffers immediately (ever seen a huge jump in speed at the beginning
of a transfer, this is the local buffer being filled, and the application
has no way to know if this data is going out to the wire, or just to the
kernel). Then the network must drain the packets onto the wire, sometimes
very slowly (think about a dialup user downloading from your GigE server).
Setting the socket buffers too high can potentially result in an
incredible waste of resources, and can severely limit the number of
simultaneous connections your server can support. This is precisely why
OS's cannot ship with huge default values, because what may be appropriate
for your one-user GigE connected box might not be appropriate for someone
else's 100BASE-TX web server (and guess which setup has more users :P).

On the receive size, the socket buffers must be large enough to
accommodate all the data received between application read()'s, as well
as making sure they have enough available space to hold future data in the
event of a "gap" due to loss and the need for retransmission. However, if
the application fails to read() the data from the socket buffer, it will
sit there forever. Large socket buffers also opens the server up to
malicious attack causing non-swapable kernel memory to consume all
available resources, either locally (by someone dumping data over lots of
connections, or running an application which intentionally fails to read
data from the socket buffer), or remotely (think someone opening a bunch
of rate limited connections from your "high speed server"). It can even be
unintentional, but just as bad (think a million confused dialup users
accidentally clicking on your high speed video stream).

Some of this can be worked around by implementing what is called
auto-tuning socket buffers. In this case, the kernel would limit the
amount of data allowed into the buffer, by looking at the tcp session's
observed congestion window. This allows you to define large send buffers
without applications connected to slow receivers sucking up unnecessary
resourced. PSC has had example implementations for quite a while, and
recently FreeBSD even added this (sysctl net.inet.tcp.inflight_enable=1 as
of 4.7). Unfortunately, there isn't much you can do to prevent malicious
receive-side buffer attacks, short of limiting the overall max buffer
(FreeBSD implements this as an rlimit "sbsize").

Of course, you need a few other things before you can start getting into
end to end gigabit speeds. If you're transfering a file, you probably
don't want to be reading it from disk via the kernel just to send it back
to the kernel again for transmission, so various things like sendfile()
and zero copy implementations help get you the performance you need
locally. Jumbo frames help too, but their real benefit is not the
simplistic "hey look theres 1/3rd the number of frames/sec" view that many
people see. The good stuff comes from techniques like page flipping, where
the NIC DMA's data into a memory page which can be flipped through the
system straight to the application, without copying it throughout. Some
day TCP may just be implemented on the NIC itself, with ALL work
offloaded, and the system doing nothing but receiving nice page-sized
chunks of data at high rates of speed. IMHO the 1500 byte MTU of ethernet
will still continue to prevent good end to end performance like this for a
long time to come. But alas, I digress...

On the send size, the application transmitting is guaranteed to utilize
the buffers immediately (ever seen a huge jump in speed at the beginning
of a transfer, this is the local buffer being filled, and the application
has no way to know if this data is going out to the wire, or just to the
kernel). Then the network must drain the packets onto the wire, sometimes
very slowly (think about a dialup user downloading from your GigE server).

Actually this is often way too fast as the congestion window doubles
with each ACK. This means that with a large buffer = large window and a
bottleneck somewhere along the way, you are almost guaranteed to have
some serious congestion in the early stages of the session and lower
levels of congestion periodially later on whenever TCP tries to figure
out how large the congestion window can get without losing packets.

This is the part about TCP that I've never understood: why does it send
large numbers of packets back-to-back? This is almost never a good idea.

On the receive size, the socket buffers must be large enough to
accommodate all the data received between application read()'s,

That's not true. It's perfectly acceptable for TCP to stall when the
receiving application fails to read the data fast enough. (TCP then
simply announces a window of 0 to the other side so the communication
effectively stops until the application reads some data and a >0 window
is announced.) If not, the kernel would be required to buffer unlimited
amounts of data in the event an application fails to read it from the
buffer for some time (which is a very common situation).

locally. Jumbo frames help too, but their real benefit is not the
simplistic "hey look theres 1/3rd the number of frames/sec" view that many
people see. The good stuff comes from techniques like page flipping, where
the NIC DMA's data into a memory page which can be flipped through the
system straight to the application, without copying it throughout. Some
day TCP may just be implemented on the NIC itself, with ALL work
offloaded, and the system doing nothing but receiving nice page-sized
chunks of data at high rates of speed.

Hm, I don't see this happening to a usable degree as TCP has no concept
of records. You really want to use fixed size chunks of information here
rather than pretending everything's a stream.

IMHO the 1500 byte MTU of ethernet
will still continue to prevent good end to end performance like this for a
long time to come. But alas, I digress...

Don't we all? I'm afraid you're right. Anyone up for modifying IPv6 ND
to support a per-neighbor MTU? This should make backward-compatible
adoption of jumboframes a possibility. (Maybe retrofit ND into v4 while
we're at it.)

Iljitsch van Beijnum

> On the receive size, the socket buffers must be large enough to
> accommodate all the data received between application read()'s,

That's not true. It's perfectly acceptable for TCP to stall when the
receiving application fails to read the data fast enough. (TCP then
simply announces a window of 0 to the other side so the communication
effectively stops until the application reads some data and a >0 window
is announced.) If not, the kernel would be required to buffer unlimited
amounts of data in the event an application fails to read it from the
buffer for some time (which is a very common situation).

Ok, I think I was unclear. You don't NEED to have buffers large enough to
accommodate all that data received between application read()'s, unless
you are trying to achieve maximum performance. I thought that was the
general framework we were all working under. :slight_smile:

> locally. Jumbo frames help too, but their real benefit is not the
> simplistic "hey look theres 1/3rd the number of frames/sec" view that many
> people see. The good stuff comes from techniques like page flipping, where
> the NIC DMA's data into a memory page which can be flipped through the
> system straight to the application, without copying it throughout. Some
> day TCP may just be implemented on the NIC itself, with ALL work
> offloaded, and the system doing nothing but receiving nice page-sized
> chunks of data at high rates of speed.

Hm, I don't see this happening to a usable degree as TCP has no concept
of records. You really want to use fixed size chunks of information here
rather than pretending everything's a stream.

We're talking optimizations for high performance transfers... It can't
always be a stream.

> IMHO the 1500 byte MTU of ethernet
> will still continue to prevent good end to end performance like this for a
> long time to come. But alas, I digress...

Don't we all? I'm afraid you're right. Anyone up for modifying IPv6 ND
to support a per-neighbor MTU? This should make backward-compatible
adoption of jumboframes a possibility. (Maybe retrofit ND into v4 while
we're at it.)

Not necessarily sure thats the right thing to do, but SOMETHIG has got to
be better than what passes for path mtu discovery now. :slight_smile:

> > On the receive size, the socket buffers must be large enough to
> > accommodate all the data received between application read()'s,

> That's not true. It's perfectly acceptable for TCP to stall when the
> receiving application fails to read the data fast enough.

Ok, I think I was unclear. You don't NEED to have buffers large enough to
accommodate all that data received between application read()'s, unless
you are trying to achieve maximum performance. I thought that was the
general framework we were all working under. :slight_smile:

You got me there. :slight_smile:

It seemed that you were talking about more general requirements at this
point, though with the upper and lower limits for kernel buffer space
and all.

> Hm, I don't see this happening to a usable degree as TCP has no concept
> of records. You really want to use fixed size chunks of information here
> rather than pretending everything's a stream.

We're talking optimizations for high performance transfers... It can't
always be a stream.

Right. But TCP is a stream protocol. This has many advantages, nearly
all of which are irrelevant for high volume high bandwidth bulk data
transfer.

I can imagine a system that only works in one direction and where the
data is split into fixed size records (which would ideally fit into a
single packet) where each record is acknowledged independently (but
certainly not for each individual packet). I would also want to take
advantage of traffic classification mechanisms: first the data is
flooded at the maximum speed at the lowest possible traffic class.
Everything that doesn't make it to the other end is then resent slower
with a higher traffic class. If the network supports priority queuing
then this would effectively sponge up all free bandwidth without
impacting regular interactive traffic. If after a few retries some data
still didn't make it: simply skip this for now (but keep a record of the
missing bits) and keep going. Many applications can live with some lost
data and for others it's probably more efficient to keep running at high
speed and repair the gaps afterwards.

> > IMHO the 1500 byte MTU of ethernet
> > will still continue to prevent good end to end performance like this for a
> > long time to come. But alas, I digress...

> Don't we all? I'm afraid you're right. Anyone up for modifying IPv6 ND
> to support a per-neighbor MTU? This should make backward-compatible
> adoption of jumboframes a possibility. (Maybe retrofit ND into v4 while
> we're at it.)

Not necessarily sure thats the right thing to do, but SOMETHIG has got to
be better than what passes for path mtu discovery now. :slight_smile:

We can't replace path MTU discovery (but hopefully people will start to
realize ICMP messages were invented for another reason than job security
for firewalls). But what we need is a way for 10/100 Mbps 1500 byte
hosts to live with 1000 Mbps 9000 byte hosts on the same subnet. I
thought IPv6 neighbor discovery supported this because ND can
communicate the MTU between hosts on the same subnet, but unfortunately
this is a subnet-wide MTU and not a per-host MTU, which is what we
really need.

Iljitsch

Thus spake "Iljitsch van Beijnum" <iljitsch@muada.com>

This is the part about TCP that I've never understood: why does it
send large numbers of packets back-to-back? This is almost never a
good idea.

Because until you congest the network to the point of dropping packets, a
host has no idea how much bw is actually available. Exponential rate
growith finds this value very quickly.

Hm, I don't see this happening to a usable degree as TCP has no
concept of records. You really want to use fixed size chunks of
information here rather than pretending everything's a stream.

A record-oriented, reliable transport would make many protocols much easier
to implement. Too bad SCTP hasn't seen wider use.

S

Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

[just discovered in my unsent messages queue from offline composition,
probably not timely, but...]

Iljitsch van Beijnum wrote:

We can't replace path MTU discovery (but hopefully people will start to
realize ICMP messages were invented for another reason than job security
for firewalls). But what we need is a way for 10/100 Mbps 1500 byte
hosts to live with 1000 Mbps 9000 byte hosts on the same subnet. I
thought IPv6 neighbor discovery supported this because ND can
communicate the MTU between hosts on the same subnet, but unfortunately
this is a subnet-wide MTU and not a per-host MTU, which is what we
really need.

A decade ago, when I designed SIPP Neighbor Discovery, it saved per
destination "maximum unfragmented datagram size" in the route cache,
and each I-Am-Here message Heard specified Maximum Receive Unit (MRU)
per host. Thus, once upon a time, IPv6 had what you need.

Unfortunately, the IPv6 group stripped out such innovative features.
I stopped paying attention after the new editor stated something like
"it worked for ethernet, we really don't need any more than that."

Well, we used IPv4 from '83, and designed SIPP (cum IPv6) in '93.

IPv6 is a failure -- maybe it's time for this decade's design?

Or maybe even some of the features some of us thought we needed a
decade ago?