Linux Router: TCP slow, UDP fast

Hi All,

I'm losing the will to live with this networking headache ! Please feel free
to point me at a Linux list if NANOG isn't suitable. I'm at a loss where
else to ask.

I've diagnosed some traffic oddities and after lots of head-scratching,
reading and trial and error I can say with certainty that:

With and without shaping and over different bandwidth providers using the
e1000 driver for an Intel PRO/1000 MT Dual Port Gbps NIC (82546EB) I can
replicate full, expected throughput with UDP but consistently only get
300kbps - 600kbps throughput _per connection_ for outbound TCP (I couldn't
find a tool I trusted to replicate ICMP traffic). Multiple connections are
cumulative and increase incrementally at roughly 300kbps - 600kbps. Inbound
seems slightly erratic in holding a consistent speed but manages 15Mbps as
expected, a far cry from 300kbps to 600kbps.

The router is Quad Core sitting at no load and there's very little traffic
being forwarded back and forth. The NIC's kernel parameters are set at
default as 'built-in'. NAPI is not enabled though (enabling it requires a
reboot which is a problem as this box is in production).

The only other change to the box is that over Christmas IPtables
(ip_conntrack and its associated modules mainly) was loaded into the kernel
as 'built-in'. There's no sign of packet loss on any tests and I upped the
conntrack max_connections size suitably for the amount of RAM. Has anyone
come across IPtables without any rules loaded causing throughput issues ?

I've also changed the following kernel parameters with no luck:

  net.core.rmem_max = 16777216
  net.core.wmem_max = 16777216

  net.ipv4.tcp_rmem = 4096 87380 16777216
  net.ipv4.tcp_wmem = 4096 65536 16777216

  net.ipv4.tcp_no_metrics_save = 1

  net.core.netdev_max_backlog = 2500

  echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

It feels to me like a buffer limit is being reached 'per connection'. The
throughput spikes at around 1.54Mbps and TCP backs off to about 300kbps -
600kbps or so. What am I missing ? Is NAPI that essential for such low
traffic ? A very similar build moved far higher throughput on cheap NICs.
MTU is at 1500, txqueuelen is 1000.

Any help would be massively appreciated !

Chris

Try enabling window scaling
  echo 1 > /proc/sys/net/ipv4/tcp_window_scaling
or, if you really want it disabled, configure a larger minimum window size
  net.ipv4.tcp_rmem = 64240 87380 16777216

HTH,
Lee

I'm losing the will to live with this networking headache ! Please feel free
to point me at a Linux list if NANOG isn't suitable. I'm at a loss where
else to ask.

The linux-net might be more appropriate indeed.

With and without shaping and over different bandwidth providers using the
e1000 driver for an Intel PRO/1000 MT Dual Port Gbps NIC (82546EB) I can
replicate full, expected throughput with UDP but consistently only get
300kbps - 600kbps throughput _per connection_ for outbound TCP

I've seen this behavior as the result of duplex mismatches.

(The tcp settings are end system matters and do not affect how the
router forwards traffic.)

Without window scaling, you're limited to 64k window size anyway.

Chris, what is the round trip delay between the machines involved in your TCP session?

Thanks loads for the quick replies. I'll try and respond individually.
Lee > I recently disabled tcp_window_scaling and it didn't solve the
problem. I don't know enough about it. Should I enable it again ? Settings
differing from defaults are copied in my first post.

Mike > Strangely I'm not seeing any errors on either the ingress or egress
NICs:

          RX packets:3371200609 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3412500706 errors:0 dropped:0 overruns:0 carrier:0

The only errors I see anywhere are similar on both NICs. Both connect to the
same model of switch with the same default config:

     rx_long_byte_count: 1396158525465
     rx_csum_offload_good: 3341342496
     rx_csum_offload_errors: 89459

and it may be worth noting that flow control is on. Are these a reasonable
level of pause frames to be seeing ? They seem to be higher on non-routing
boxes.

Total bytes (TX)2466202288Unicast packets (TX)3436389971Multicast packets
(TX)213310Broadcast packets (TX)4952902Single Collision Frames (TX)0Late
Collisions (TX)0Excessive Collisions (TX)0Transmitted Pause Frames (TX)27806

Florian > They're running without obvious errors. Auto Neg has taken 1Gbps,
Full. Can Auto Neg cause these symptoms do you think ?

Thanks again,

Chris

Hello, Chris,

So, as it seems you have problem with TCP, and not UDP, maybe this is
something with regard to TCP segmentation offloading.

It could be a total shot in the dark, but can you see what
ethtool -k <devname> says?

Then you can have a look at 'man ethtool' and turn on/off the appropriate
stuff.

Thanks, Nickola.
What's your opinion on these settings ? Do you recommend switching off "tcp
segmentation offload" ?

Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on

Thanks again,

Chris

Thanks loads for the quick replies. I'll try and respond individually.
Lee > I recently disabled tcp_window_scaling and it didn't solve the
problem. I don't know enough about it. Should I enable it again ? Settings
differing from defaults are copied in my first post.

I don't know if the tcp window size makes any difference when the box
is acting as a router. But when UDP works as expected & each
additional TCP connection gets 300-600kbps the first thing I'd look at
is the window size. If it was a duplex mismatch additional TCP
connections would make things worse instead of each getting 3-600Kb
bandwidth.

The only other change to the box is that over Christmas IPtables
(ip_conntrack and its associated modules mainly) was loaded into the kernel

If all else fails, backing out recent changes usually works :slight_smile:

Regards,
Lee

Thanks very much, Lee. My head's whirring. Am I right in thinking by turning
on scaling (which I just did) then the window size is automatically set ?
I'll do some more reading.
I'm looking at TSO too as above, mentioned by Nickola. I'll maybe risk
changing it with ethtool during a quiet network moment.
I've just discovered the netstat -s command which gives loads more info than
anything else I've come across. Any pointers about window size or TSO from
the output appreciated :slight_smile:
Thanks again,

Chris

No. Scaling just allows you to have a window size larger than 64KB.

These might help
  http://www-didc.lbl.gov/TCP-tuning/troubleshooting.html
  http://www-didc.lbl.gov/TCP-tuning/linux.html

Regards,
Lee

Thanks a lot, Lee.

Hi Mikael,
I just realised that I didn't respond to your post.

The RTTs vary massively because the router is forwarding from websites on
the LAN to visitors worldwide. Is that what you meant ?

Disabling TSO didn't work unfortunately.

Thanks again,

Chris

I'm looking at TSO too as above, mentioned by Nickola. I'll maybe risk
changing it with ethtool during a quiet network moment.

Turning off offloading might be something to try indeed.

Regarding the negotation issue, can you look at the other end of the
link and check what it's saying?

Looking at "netstat -s" statistics at the endpoint (not the router)
could be illuminating, too. I haven't got any expertise in this area,
but TCP problems can often be diagnosed by looking at tcpdump/packet
captures and analyzing them using tcptrace (and the special xplot
variant which can plot tcptrace output).

And your TCP speed when doing testing is always 300-600 kilobyte/s regardless of RTT between the boxes with which you're testing?

Without TCP window scaling turned on on the boxes doing TCP with each others, you're always limited to 1/RTT*64k bytes/s of transfer speed.

Changing window scaling on the linux router will of course not change the behaviour of the traffic going thru it, only TCP sessions that itself does.

Thanks for all the answers. I'm currently going down the path of looking at
IPtables' conntrack slowing the forwarding rate.

If I can't find any more docs then I'll boot the router with a kernel
without any IPtables built-in and see if that's it.

As Lee said "rollback" ! That's the last change to the box. If I can rule
out the logging of traffic from conntrack is slowing down
the forwarding then I can look into hardware further :wink:

Chris

Thanks for all the answers. I'm currently going down the path of looking at
IPtables' conntrack slowing the forwarding rate.

If I can't find any more docs then I'll boot the router with a kernel
without any IPtables built-in and see if that's it.

As Lee said "rollback" ! That's the last change to the box. If I can rule
out the logging of traffic from conntrack is slowing down
the forwarding then I can look into hardware further :wink:

Chris

If this router is not doing some kind of proxying, tuning tcp related kernel bits will not impact "long fat pipe" or "long fat network" issues.

The place that needs to be tuned for larger window sizes/scaling is the web server.

http://www.psc.edu/networking/projects/tcptune/#Linux

or search for "Linux TCP tuning" or "large fat pipes"

Also make sure your firewall isn't "helping" you out by "cleaning up" the TCP SYN/ACK sequence and fiddling with the window scaling stuff.

Also if you have load balancers, they might break this stuff as well.

Good luck,
-Allen

Chris wrote:

Thanks, Karl, Allen and Nickola.
I failed-over to another router last night and briefly had full expected
throughput but this morning despite dropping providers and moving between
routers again for trial and error I still see _outbound_ TCP at about the
same 300 - 600kbps per session.

I eliminated conntract modules firstly, then iptables as a whole. I've
eliminated TSO and checksumming (which caused very sticky connections) on
the e1000 NIC.

The failover router has a slightly older kernel and was working before
Christmas so it's not most likely not kernel versions. I've also tried
removing FIB_TRIE as a stab in the dark with no success. And the failover
router connects using FE not GE so I've eliminated NICs and connection
speeds to a front-facing switch.

The only constant is the front-facing switch (it's negotiating perfectly at
FD though) so all I can think of is removing that from the equation.

It's definitely only _outbound_ TCP getting buffered though ! I've pushed
92Mbps on a FE link with UDP and uploaded at 16Mbps on a 16Mbps link.

Any last ideas appreciated before causing headaches removing switches would
be appreciated.

Thanks,

Chris

Thanks for the suggestions everyone.
I've got to the bottom of the problem now (I'm sure there will be a
collective sigh of relief from the list because of the noise this thread
generated :-)).

I installed two brand new, low spec, 3Com switches one at the 'front' of the
network and one 'behind' the routers. They are the same model, same latest
firmware, same config (saved to and then copied off disk) and only their IPs
were different.

The front switch was the problem. As two final tests before removing it we
switched off unicast/multicast broadcast control and flow control
(and simultaneously on the same port on the switch behind the routers)
because there were pause frames showing but not a massive amount in terms of
percentage.

The switch behind the routers however is serving the same bandwidth equally
well ! We've put an ancient switch in place of the front switch and its been
working perfectly so far.

My lesson learned is too change as little as possible at once ! That said,
recent network changes were spread about a month apart and this very odd
issue was far easier to dismiss than believe due to its bizarre nature.
Especially when providers have changes in their network conditions as
testing is done.

I really appreciate all the input and have learnt loads, possibly just not
in the way I would have liked to :slight_smile:

Doubles all round,

Chris

The TCP offloading should be suspect. Any current PC hardware should
be able to deal with huge amounts of traffic without any offloading.
Start with turning that off, so everything will be handled by Linux
directly. Even if you still would have a problem, it's easier to trouble
shoot without magic black boxes (TOE) in between.

   -Geert