from the academic side of the house

For the first set of IPv6 records, a team from the University of Tokyo, WIDE
Project, NTT Communications, JGN2, SURFnet, CANARIE, Pacific Northwest
Gigapop and other institutions collaborated to create a network path over
30,000 kilometers in distance, crossing 6 international networks - over 3/4
the circumference of the Earth. In doing so, the team successfully
transferred data in the single and multi-stream categories at a rate of 7.67
Gbps which is equal to 230,100 terabit-meters per second (Tb-m/s). This
record setting attempt leveraged standard TCP to achieve the new mark.

The next day, the team used a modified version of TCP to achieve an even
greater record. Using the same 30,000 km path, the network was able to
achieve a throughput of 9.08 Gbps which is equal to 272,400 Tb-m/s for both
the IPv6 multi and single stream categories. In doing so, the team surpassed
the current IPv4 records, proving that IPv6 networks are able to provide the
same, if not better, performance as IPv4.

--bill

As one of the poor bastards still involved in rolling out VoIP over satellite
delivered IP at the moment, I can safely say I'm (currently) happy noone's trying
to push H.323 over IPv6 over these small-sized satellite links. Lord knows
we have enough trouble getting concurrent calls through 20 + 20 +
byte overheads when the voice payload's -20- bytes.

(That said, I'd be so much happer if the current trend 'ere wasn't to -avoid-
delivering serial ports for the satellite service so we can run VoFR or PPP
w/header compression - instead being presented IP connectivity only at
either end, but you can't have everything..)

Adrian

Adrian Chadd wrote:

bmanning@karoshi.com writes:

The next day, the team used a modified version of TCP to achieve an
even greater record. Using the same 30,000 km path, the network was
able to achieve a throughput of 9.08 Gbps which is equal to 272,400
Tb-m/s for both the IPv6 multi and single stream categories. In doing
so, the team surpassed the current IPv4 records, proving that IPv6
networks are able to provide the same, if not better, performance as
IPv4.

Good job. Two questions, though:

(1) Do the throughput figures count only the data payload (i.e.,
anything above the TCP layer), or all the bits from the protocol
stack? If the latter, it seems a little unreasonable to credit
IPv6 with its own extra overhead -- though I'll concede that with
jumbo datagrams, that's not all that much.

(2) Getting this kind of throughput seems to depend on a fast
physical layer, plus some link-layer help (jumbo packets), plus
careful TCP tuning to deal with the large bandwidth-delay product.
The IP layer sits between the second and third of those three items.
Is there something about IPv6 vs. IPv4 that specifically improves
perfomance on this kind of test? If so, what is it?

Jim Shankland

Jim Shankland wrote:

bmanning@karoshi.com writes:

The next day, the team used a modified version of TCP to achieve an
even greater record. Using the same 30,000 km path, the network was
able to achieve a throughput of 9.08 Gbps which is equal to 272,400
Tb-m/s for both the IPv6 multi and single stream categories. In doing
so, the team surpassed the current IPv4 records, proving that IPv6
networks are able to provide the same, if not better, performance as
IPv4.
    
Good job. Two questions, though:

(1) Do the throughput figures count only the data payload (i.e.,
anything above the TCP layer), or all the bits from the protocol
stack? If the latter, it seems a little unreasonable to credit
IPv6 with its own extra overhead -- though I'll concede that with
jumbo datagrams, that's not all that much.

(2) Getting this kind of throughput seems to depend on a fast
physical layer, plus some link-layer help (jumbo packets), plus
careful TCP tuning to deal with the large bandwidth-delay product.
The IP layer sits between the second and third of those three items.
Is there something about IPv6 vs. IPv4 that specifically improves
perfomance on this kind of test? If so, what is it?

Jim Shankland
  
Also, it's a "modified" TCP not just tuned. I wonder how modified it is? Will it talk to an un-modified TCP stack (whatever that really is) ?

I wonder if the router forward v6 as fast.

    --Steve Bellovin, http://www.cs.columbia.edu/~smb

Steven M Bellovin writes:

(2) Getting this kind of throughput seems to depend on a fast
physical layer, plus some link-layer help (jumbo packets), plus
careful TCP tuning to deal with the large bandwidth-delay product.
The IP layer sits between the second and third of those three items.
Is there something about IPv6 vs. IPv4 that specifically improves
perfomance on this kind of test? If so, what is it?

I wonder if the router forward v6 as fast.

In the 10 Gb/s space (sufficient for these records, and I'm not
familiar with 40 Gb/s routers), many if not most of the current gear
handles IPv6 routing lookups "in hardware", just like IPv4 (and MPLS).

For example, the mid-range platform that we use in our backbone
forwards 30 Mpps per forwarding engine, whether based on IPv4
addresses, IPv6 addresses, or MPLS labels. 30 Mpps at 1500-byte
packets corresponds to 360 Gb/s. So, no sweat.

Routing table lookups(*) are what's most relevant here, because the other
work in forwarding is identical between IPv4 and IPv6. Again, many
platforms are able to do line-rate forwarding between 10 Gb/s ports.

Tony Li writes:

Routing table lookups(*) are what's most relevant here, [...]

Actually, what's most relevant here is the ability to get end-hosts
to run at rate. Packet forwarding at line rate has been
demonstrated for quite awhile now.

That's true (although Steve's question was about the routers).

The host bottleneck for raw 10Gb/s transfers used to be bus bandwidth.
The 10GE adapters in most older land-speed record entries used the
slower PCI-X, while this entry was done with PCI Express (x8) adapters.

Another host issue would be interrupts and CPU load for checksum, but
most modern 10GE (and also GigE!) adapters offload checksum
segmentation and reassembly, as well as checksum computation and
validation to the adapter if the OS/driver supports it.

The adapters used in this record (Chelsio S310E) contain a full TOE
(TCP Offload Engine) that can run the entire TCP state machine on the
adapter, although I'm more sure whether they made use of that.
Details on

http://data-reservoir.adm.s.u-tokyo.ac.jp/lsr-200612-02/

Mind you, those crazy Japanese do this every year between christmas
and newyear... :wink: Most of the pipes they used also carry other
research traffic throughout most of the year... This year was even
more cumbersome because of some issues with the OC192's between
Amsterdam and the USA...

Kind regards,
JP Velders

Date: Tue, 24 Apr 2007 09:24:13 -0700
From: Jim Shankland <nanog@shankland.org>
Subject: Re: from the academic side of the house

(1) Do the throughput figures count only the data payload (i.e.,
anything above the TCP layer), or all the bits from the protocol
stack? If the latter, it seems a little unreasonable to credit
IPv6 with its own extra overhead -- though I'll concede that with
jumbo datagrams, that's not all that much.

Data payload is counted as bytes transmitted and received by iperf. So
application layer all the way.

(2) Getting this kind of throughput seems to depend on a fast
physical layer, plus some link-layer help (jumbo packets), plus
careful TCP tuning to deal with the large bandwidth-delay product.

That last part has been researched for quite some time already, though
mainly with "long" transatlantic layer 2 (Ethernet) paths mainly.

The IP layer sits between the second and third of those three items.
Is there something about IPv6 vs. IPv4 that specifically improves
perfomance on this kind of test? If so, what is it?

Not that was specificly mentioned for this test I believe...

Kind regards,
JP Velders

Date: Thu, 26 Apr 2007 19:11:37 +0200
From: Simon Leinen <simon@limmat.switch.ch>
Subject: Re: from the academic side of the house

[ ... ]
Another host issue would be interrupts and CPU load for checksum, but
most modern 10GE (and also GigE!) adapters offload checksum
segmentation and reassembly, as well as checksum computation and
validation to the adapter if the OS/driver supports it.

Correct, though sometimes the performance can vary wildly depending on
driver and firmware. I had to turn off all the checksumming with
certain driver versions for the Myrinet 10GE cards, later drivers
performed OK.

Also, having multi-cpu and multi-core machines helps enormously,
binding a NIC/driver to a certain core, and having the receiving
process on the neighbouring core can dramatically increase throughput.

But per adapter/driver differences in which offloading actually
increases performance sometimes still is a black art, necessitating
painstaking testing...

The adapters used in this record (Chelsio S310E) contain a full TOE
(TCP Offload Engine) that can run the entire TCP state machine on the
adapter, although I'm more sure whether they made use of that.

I believe that they did not use it. In the past we also ran into
problems and performance dips when using the accellerated Chelsio
cards. Without TOE accelleration things worked much smoother.

Kei's team did modify iperf such that they used mmap's to gain more of
a zero-copy behaviour, depending on drivers this can also give much
better performance, akin to say Myricom's claims for Myrinet...

Kind regards,
JP Velders

we -love- the crazy Japanese doing this kind of stuff.
  the US folks seemed to have lost momentum in the past decade.
  while the pipes do get re-purposed on a regular basis, they
  do tend to shake out interoperable problems, as you note above.

  me, i await the spiral loop that includes the southern
  hemisphere ...

--bill