Software router state of the art

Date: Wed, 23 Jul 2008 16:51:50 -0400
From: "William Herrin" <herrin-nanog@dirtside.com>
Sender: wherrin@gmail.com

>> The first bottleneck is the interrupts from the NIC. With a generic
>> Intel NIC under Linux, you start to lose a non-trivial number of
>> packets around 700mbps of "normal" traffic because it can't service
>> the interrupts quickly enough.
>
> Most modern high performance network cards support MSI (Message Signaled
> Interrupts) which generate real interrupts only in an intelligent
> basis. and only at a controlled rate. Windows, Solaris and FreeBSD have
> support for MSI and I think Linux does, too. It requires both hardware
> and software support.

"ethtool -c". Thanks Sargun for putting me on to "I/O Coalescing."

But cards like the Intel Pro/1000 have 64k of memory for buffering
packets, both in and out. Few have very much more than 64k. 64k means
32k to tx and 32k to rx. Means you darn well better generate an
interrupt when you get near 16k so that you don't fill the buffer
before the 16k you generated the interrupt for has been cleared. Means
you're generating an interrupt at least for every 10 or so 1500 byte
packets.

You have just hit on a huge problems with most (all?) 1G and 10G
hardware. The buffers are way too small for optimal performance in any
case where the RTT is anything more that half a millisecond, you exhaust
the window and stall the stream.

I need port move multi-gigabit streams across the country and between the
US and Europe. Those are a bit too far apart for those tiny buffers to
be of any use at all. This would require 3 GB of buffers. This same
problem also make TCP off-load of no use at all.

3 Gigabyte? Why?

The newer 40G platforms on the market seems to have abandonded the 600ms buffers typical in the 10G space, in favour of 50-200ms of buffers (I don't remember exactly).

Aren't there TCP implementations that don't use exponential window increase, but instead can do smaller increments, which I would have believed would enable routers to still do well with ~50ms of buffering.

High speed memory is very expensive, also a lot of applications today would prefer to have their packets dropped instead of being queued for hundreds of milliseconds. Finding a good tradeoff level between the demand of different traffic types is quite hard...

Also, DWDM capacity seems to get cheaper all the time, so if you really need to move data at multigigabit speeds, it might make sense to just rent that 10G wave and put your own equipment there that does what you want.

Now, there is an exploit for it.

http://www.caughq.org/exploits/CAU-EX-2008-0002.txt

Robert D. Scott Robert@ufl.edu
Senior Network Engineer 352-273-0113 Phone
CNS - Network Services 352-392-2061 CNS Receptionist
University of Florida 352-392-9440 FAX
Florida Lambda Rail 352-294-3571 FLR NOC
Gainesville, FL 32611 321-663-0421 Cell

Now, there is an exploit for it.

http://www.caughq.org/exploits/CAU-EX-2008-0002.txt

For anyone looking to use it, you MUST update the frameworks
libraries. Some of the code only came out ~5 hours ago that
it needs.

    Tuc/TBOH