TCP Window Scaling issue

Hello,

I know this isn't precisely on topic but I'm having an issue that I could
use some assistance with.

I'm currently seeing a very interesting issue for a single server. File
transfers from Server A to Server B are relatively slow and not using up
much of the circuit. Upon further inspection the TCP window size remains at
default 65535 and window scaling doesn't negotiate. What's interesting is
this is only affecting a single server and only when traffic is going over
the WAN circuit. Testing from Server A to any server on it's network shows
it is negotiating window scaling just fine.

Below I'll try and draw out a better idea of what is happening. Let the
letters represent the server in question and let the .# represent which
subnet they are on to show whether transversal of the WAN circuit is
occurring.

Server A.1 -> Server B.2 = No window scaling

Server A.1 -> Server C.1 = Window scaling

Server B.2 -> Server A.1 = Window scaling

Server C.1 -> Server B.2 = Window scaling

The net result here is when window scaling is properly being used I'm
seeing about 30-40 Mbps of bandwidth usage, without scaling I'm only seeing
2.8Mbps. Any thoughts?

Check your firewall isn't buggering about with TCP options.

Tony.

Hi Tony. No firewall in the way.

Physical flow is as below.

Server A -> Nexus 7k -> 3845 router -> Sprint MPLS -> 3845 router -> Cisco
3750x stack -> Server B

This, exactly. I diagnosed this issue a while back with our Checkpoint
firewall - it didn't understand TCP window scaling so it would blindly
zero out the field and cause nightmares.

M.

I blame the cloud.

Dump the actual packets as they leave Server A and arrive at Server B
(and vice-versa!). Does it get modified en route?

M.

Hi Machael,

Let me setup another packet capture at each side to see if the initial
packets are being modified at all.

Thanks,

Also just to reiterate I would lean more heavily on something fishing in
the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp
window scaling properly, however it's only for Server A that is seeing
this. Server A is able to properly TCP window scale for any local traffic.

Also just to reiterate I would lean more heavily on something fishy in
the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp
window scaling properly, however it's only for Server A that is seeing
this. Server A is able to properly TCP window scale for any local traffic.

Remember, the WAN cloud is just that, a cloud;
it's not likely to be a single link underneath it all;
so one bad link/bad port/bad device in the cloud
can affect just a sub-portion of the traffic, depending
on the 5-tuple hashing that takes place.

An interesting test would be to be give server A
a different address (secondary address should be
fine, all you need to do is source packets from a
different source address) and see if your scaling
suddenly reappears. If it does, it's definitely down
to the 5-tuple hashing happening within The Cloud(tm).

Matt

*First round of packet captures*

Here are the snippets from a packet capture.

First is the SYN from Server A to Server B http://i.imgur.com/E5cu4ev.png Here
is the SYN from Server B backhttp://i.imgur.com/RRSAl8G.png

Second test from Server C to Server B: First is the SYN from Server C to
Server B http://i.imgur.com/Jc2K6bT.pngand the SYN from Server B to Server
C http://i.imgur.com/pbvx9jJ.png

I guess I'm at a loss as to why in scenario 1 neither are sending window
scaling at all. Is it because Server A isn't attempting or initializing?

I'm in the process of setting up a VM that I can SPAN for a capture from
the source of Server A. This will allow me to compare packets at each side.

*Second round of packet captures*

Now I just don't even know what is going on...

Is this quantum physics now? Did the state just change by me looking at it?
Here are some new screencaps. The only change that's been made was a SPAN
port enabled on the Nexus7k sourced at Server A and destination for my new
tcpdump capture server.

Site 1 captures: 1 http://i.imgur.com/K5r7FaG.png 2
http://i.imgur.com/wfnfLyi.png

Site 2 capture: 1 http://i.imgur.com/vpY2lnh.png 2
http://i.imgur.com/UyL3V6L.png

Now they are both communicating a window size. Speed is still slow at
400-450KBps

Was this captured with tcpdump on Server A on its way out, or on Server B
on its way in, or at some other point using a span port? The answer matters
if we're suspecting that something along the way is stomping the option....

All are from SPAN ports at each end. So for the second round of packet
captures Site 1 is from a SPAN port off the NIC of Server A. Site 2 is from
a SPAN port off the NIC of the MPLS router.

The first round of packet captures are only from the SPAN port off the MPLS
router at Site 2.

All are from SPAN ports at each end. So for the second round of packet
captures Site 1 is from a SPAN port off the NIC of Server A. Site 2 is from
a SPAN port off the NIC of the MPLS router.

The first round of packet captures are only from the SPAN port off the
MPLS router at Site 2.

I have to dash out of a few hours; but the short
answer is the first round of packet captures
are too far from the host to matter.

second set are doing better, but still
would be best to compare with tcpdumps
from the device A itself, to see what it
thinks it's sending out, vs what is seen
upstream of it. Can you grab tcpdumps
from server A itself?

Thanks!

Matt

I don't have root access to that server but I should be able to get it then
get some tcpdumps.

[Sorry about the null reply.]

Also just to reiterate I would lean more heavily on something fishing in
the WAN cloud if all traffic from Site 1 to Site 2 were not seeing tcp
window scaling properly, however it's only for Server A that is seeing
this. Server A is able to properly TCP window scale for any local traffic.

I don't have enough data to make a detailed guess, but the broad brush is that on A, B, the router A talks to, the router B talks to, or a router in the path has got an ACL that knows A's IP or MAC address that has an unplanned affect.

I'd turn off all ACL's on the path and check again and if clear, turn them back on one at a time. It is not going to be something you intended to do.

One other possibility--traffic not routed by most direct, fire-wall-free route, but being detoured through a firewall.

Or a transparent layer-2 firewall that's in-line somewhere in the path . . .

Are people still using "traffic shapers"?