Stumper

I have run into a problem that has me completely stumped, so I'm tossing it
out to NANOG for some help.

Before I lay out the specifics, I'm not trying to point fingers at any
particular ISP or vendor here, but this problem only exhibits itself in very
specific configurations. Unfortunately, the configuration is common enough as
to get unwanted attention from the higher-ups.

Here's the particulars:

Users that have Verizon DSL and a Linksys cable/DSL router have difficulties
accessing sites on my network -- whether they are trying with http, https,
smtp, pop3, ssh, ftp, etc., etc. Oh, but pings seem to be fine. Low latency,
no loss. This is true even for access to a server brought up in the DMZ, to
keep the firewalls out of the equation.

Doing some packet sniffing on the ethernet side of my router, I could see
specific http requests never showed up (and the user saw the broken image
icon). This was for an mrtg graph page with +/- 30 images. I saw the request
for almost all the image files, save for one and the user reported the broken
image icon for the one. So this looks and smells like a packet loss
issue..... but who/where/how?

Taking the Linksys out of the pictures (connecting their PC directly to the
Verizon DSL modem) makes the problem go away.

These same users report no trouble whatsoever accessing many other common
sites across the internet.

Here's another interesting data point: when one user runs Morpheus (on
any machine in his home network) he then has absolutely no problems accessing
servers/services on my network.

Other users with Linksys routers and, say cable modem, do not have this
problem!

So I'm looking for some pointers. What could I have done to my edge router (a
Cisco 3640 if that helps any) that would make it drop packets from Verizon DSL
customers with Linksys routers so long as they aren't running Morpheus?

Mark J. Scheller (scheller@u1.net)

Are there sub-1500 byte MTUs anywhere and is one of the devices
(Linksys?) dropping the relevant icmp fragments?

Morpheus might be working by not having DF bit set..

just a possibility

test by removing any filtering of icmp

Steve

Could this be a packet size issue ?
You might try

ping -s

and see if, say, 1500 byte and 4500 byte packets get through.m

I have run into a problem that has me completely stumped, so I'm tossing it
out to NANOG for some help.

Before I lay out the specifics, I'm not trying to point fingers at any
particular ISP or vendor here, but this problem only exhibits itself in very
specific configurations. Unfortunately, the configuration is common enough as
to get unwanted attention from the higher-ups.

Here's the particulars:

Users that have Verizon DSL and a Linksys cable/DSL router have difficulties
accessing sites on my network -- whether they are trying with http, https,
smtp, pop3, ssh, ftp, etc., etc. Oh, but pings seem to be fine. Low latency,
no loss. This is true even for access to a server brought up in the DMZ, to
keep the firewalls out of the equation.

Doing some packet sniffing on the ethernet side of my router, I could see
specific http requests never showed up (and the user saw the broken image
icon). This was for an mrtg graph page with +/- 30 images. I saw the request
for almost all the image files, save for one and the user reported the broken
image icon for the one. So this looks and smells like a packet loss
issue..... but who/where/how?

Taking the Linksys out of the pictures (connecting their PC directly to the
Verizon DSL modem) makes the problem go away.

These same users report no trouble whatsoever accessing many other common
sites across the internet.

Here's another interesting data point: when one user runs Morpheus (on
any machine in his home network) he then has absolutely no problems accessing
servers/services on my network.

Other users with Linksys routers and, say cable modem, do not have this
problem!

So I'm looking for some pointers. What could I have done to my edge router (a
Cisco 3640 if that helps any) that would make it drop packets from Verizon DSL
customers with Linksys routers so long as they aren't running Morpheus?

Mark J. Scheller (scheller@u1.net)

                                  Regards
                                  Marshall Eubanks

T.M. Eubanks
Multicast Technologies, Inc
10301 Democracy Lane, Suite 410
Fairfax, Virginia 22030
Phone : 703-293-9624 Fax : 703-293-9609
e-mail : tme@multicasttech.com
http://www.multicasttech.com

Test your network for multicast :
http://www.multicasttech.com/mt/
  Status of Multicast on the Web :
  http://www.multicasttech.com/status/index.html

:: Here's the particulars:
::
:: Users that have Verizon DSL and a Linksys cable/DSL router have
:: difficulties accessing sites on my network -- whether they are trying
:: with http, https, smtp, pop3, ssh, ftp, etc., etc. Oh, but pings
:: seem to be fine. Low latency, no loss. This is true even for access
:: to a server brought up in the DMZ, to keep the firewalls out of the
:: equation.
::

Have the user update their linksys firmware. I see this problem all the
time. Linksys soho gateways are notorious for their early firmware not
sending fragments with proper headers. Any acl that does not allow *all
frags* by default will deny their packets. There may be other issues as
well, but the firmware update tends to fix all of the problems.

-jba

Definitely sounds like an MTU problem. I have seen IPSEC break across
Verizon DSL with a Linksys router until the MTU on the ?PCs?" where dropped
to just under 1500 bytes to allow for the IPSEC header.

DJ

The Linksys does have an MTU setting, and I've had my users try some lower
settings to see if it made any differences. One user set the MTU on the
Linksys as low as 1200 with no noticeable improvement.

Anything else I should look at?

mS (scheller@u1.net)

If the MTU is not helping then go get the latest firmware. Also you cannot use port forwarding in most linksys routers with DHCP enabled. For those routers you have to set everyone statically and turn of DHCP for port forwarding to work.

Mark J. Scheller wrote:

This would depend upon the direction of the packets that are dropped and where
the broken device is.

If the 1500 byte packets are coming in from the Internet and the Linksys needs
to forward onto a smaller MTU media but finds the DF bit set it will return an
icmp fragment.. if this icmp is then dropped back at the client then you'll see
what you describe.

If the Linksys or device infront of it will allow remove the DF bit from inbound
packets.

Steve

If you're using path MTU discovery (in other words, sending out packets
with the DF bit set), it works like this: The host on each end of the
connection has an MTU configured in its TCP stack, so on initial
connection (generally with very small syn/ack packets), the packet size
gets negotiated and set to the lower of those two numbers. If all the
router interfaces in between have an MTU equal or greater than the MTU
that gets negotiated between the hosts, packets will continue to flow at
that size without incident. In general, when you're dealing with two
ethernet connected hosts with MTUs of 1500 bytes, and a bunch of routers
in between with MTUs of greater than 1500, this is what happens. However,
if there's a network link in the path with an MTU smaller than the MTUs of
the two end devices, the large packets sent by the end devices won't be
able to pass through that link. Instead, the router with the small MTU
link sends an ICMP response back to the sending host, requesting smaller
packets. The sending host retries with progressively smaller packets
until arriving at a size that works.

Therefore, I think the scenario that people are describing here is this:
The user's computer is talking to the Linksys across a regular ethernet
with an MTU of 1500. The host on your network probably also has an MTU of
1500. The Linksys is talking to the DSL provider via PPPoE, and thus has
an MTU of 1492. The connection starts out with an initial MTU of 1500 in
each direction, but 1500 byte packets can't pass through the 1492 byte MTU
of the connection between the Linksys and the DSL provider. Therefore,
the devices on the two ends of that link would be sending back ICMP
messages requesting smaller packets. If all ICMP were being blocked
somewhere, those ICMP messages wouldn't arrive, and the host that wasn't
receiving them would keep obliviously sending out 1500 byte packets until
the connection timed out. But, if you were plugging the client computers
directly into the DSL line and running PPPoE on them, you'd have the 1492
byte MTU negotiated from the start and everything would work. In this
scenario, decreasing the Linksys's MTU wouldn't help you, because the
problem would already be that the MTU on the Linksys was smaller than the
MTU on the end points. Decreasing the MTU on the end points would help.
What would help even more would be fixing the ICMP filtering.

Now that I've said all that, this scenario doesn't really fit what you're
seeing. You said your packet sniffer showed no packets coming across, but
TCP connections don't generally start out with 1500 byte packets. In
general, when you see an path MTU discovery issue, you see the connection
being successfully opened, small packets (containing such small bits of
data as "GET /") flowing freely, and then the connection freezing when a
big burst of data gets sent for the first time.

Since that's not what you're seeing here, I'm more inclined to agree with
those who have suggested upgrading the Linksys's firmware. I don't have
any experience with that -- The Linksys NAT box on my home network works
fine and I haven't had any reason to mess with it -- but it does seem like
a far more plausible explanation for what you're seeing.

-Steve

MTU on user-end shouldn't really be an issue here.. B/c if so, then (I am only assuming this) how could they access other sites like yahoo.com, etc? I am sure your web site is no different than other common ones.

Linksys routers have various issues. The best bet is to go after the firmware and make sure its up-to-date. -- but yet they have no problems accessing other sites?? hmm.

This is probably not the cause of the issue but just in case --- You may wanna check to make sure that your server does not have ECN enabled. I've experienced some firewalls/internet sharing devices misbehaving whenever trying to connect to an ECN-enabled server. Again, this is probably not it, but just one of the things to try out, if you run out of other clues...

-hc

Mark J. Scheller wrote:

Well, you're forgetting that odd things tend to happen if MTU on one
side of the connection doesn't agree with the other side of the
link. (MTU is not a function of the transmitter or the receiver but,
rather, a function of the link you're operating on.) We all know what
kind of screwy things can happen when they disagree. (Try holding a
BGP link up over a connection where they differ... t'ain't easy.)

One other possibility here:

I once had to deal with a problem where a particular link would
receive and send data, apparently, just fine. There were errors
counting up one one end, yes, but slowly enough that it didn't seem to
indicate a real source of a problem. Well, DNS queries would go out
just fine but the responses never made it back over this link. As I
said, once connected via IP instead of domain name, everything seemed
to progress just fine. After much head scratching and a complete
visual inspection of every device in the circuit, turns out that two
pieces of gear in the middle were misoptioned.

Very weird but it did happen.

Not that I suspect that to be the problem here (I'm firmly in the MTU
court on this one until evidence shows otherwise) but it is a
possibility..

-Wayne

we used to have that problem here. a big customer from us does
many gre tunnels. the problem seemed to be that they were blocking
icmp, thus every mtu variation on the way from any point could not be
known by the routers making the point unavailable (we actually saw
the packets just before entering the tunnel). try this, ping with different
packet size and you will find this problem.
solution to the problem was to allow the icmp dunr type packets.

Sounds similar to my problem with the Linksys cable/DSL router.
My problem was that it would work perfectly with NAT enabled, but
the minute I turned NAT off, I couldn't get to a lot of sites.
I tried a number of firmwares. I even tried to get support from
Linksys. But, after a week without any returned phone calls,
I returned the unit. I do know that there is a working firmware
for this configuration, but there is no information that I could
find on down-reving the unit.

My solution was to get rid of the POS and use one of my Linux
servers to do the pppoe.

Thanks,
  Dennis