Consumer networking head scratcher

Hi everyone,

I've got a real head scratcher that I have come across after replacing
the router on my home network.

I thought I'd share because it is a fascinating issue to me.

At random times, my Windows machines (Win 7 and Win 10, attached to the
network via WiFi, 5GHz) lose connectivity to the Internet. They can
continue to access internal resources, such as the router's admin
interface. Other devices including Macs, iPhones, Android phones, and
Rokus never have this issue.

I realized that on the Windows machines, when the connection drops, if I
run a traceroute, it dies at a certain hop every time (out in Comcast's
network, who is my ISP) even though a Mac sitting right next to it is
able to go all the way through to the destination.

The even stranger thing I discovered last night is that if I trace to
the hop before the hop that it dies at, it then dies at the hop before
that (and as I trace to closer and closer hops, it dies the hop before
that!)

This is illustrated in the traces I've captured here:
http://pastebin.com/raw/R1UHLi0U

For what it's worth, the router is a Linksys EA7300 that I just picked
up.

I can't even imagine what would cause this issue at this point. If
anyone has any thoughts, I'd love to hear them!

I'm going to start studying some packet captures to see if I can spot an
issue.

Best,
Ryan

That's strange... it's like the TTL on all Windows IP packets are decrementing more and more as time goes on causing you to get less and less hops into the internet

I wonder if it's a bug/virus/malware affecting only your windows computers.

-Aaron

The issue doesn't happen with my previous router, and I've tested
multiple computers (one that isn't mine.)

It doesn't seem like it decrements over time.. it just dies sooner as I
trace further up the path. I can consistently die at the 7th hop if I
try to go to Google, but if I trace to the 6th hop, it'll die at the 5th
hop!

What's the old router make/model ?
What's the new router make/model ?

-Aaron

Hi Ryan,

Windows tracert uses ICMP echo-request packets to trace the path. It
expects either an ICMP destination unreachable message or an ICMP echo
response message to come back. The final hop in the trace will return
an ICMP echo-response or an unreachable-prohibited. The ones prior to
the final hop will return an unreachable-time-exceeded if they return
anything at all.

If the destination does not respond to ping, if those pings are
dropped, or if it responds with an unreachable that's dropped you will
not receive a response and the tracert will not find its end. That's
why you're seeing the "decrementing" behavior you describe.

I have no information about whether comcast blocks pings to its routers.

Regards,
Bill Herrin

All the Comcast gear in the path from my home router to non-Comcast addresses
will quite cheerfully rate-limit answer both pings and traceroutes.

I see what you're saying, and that could explain the decrementing
behavior I'm seeing which ultimately is not a real indicator of the
problem I am having.

So in that case, I would be back to my original issue where I stop being
able to pass traffic to the Internet, and when that happens my
traceroute always dies at the same hop. After disconnecting and
reconnecting, the same traceroute will go all the way through.

Thanks for the thoughts.

Hi Ryan,

Next step: run Wireshark and see what you see during the traceroutes.
Are they leaving with a reasonable TTL? Is it certain that nothing
returns? Are the packets going to the ethernet MAC address you expect
them to?

I had a fun problem once when I cloned some VMs but neglected to
change the source MAC address. They all seemed to work under light
load but get two downloading at once and suddenly they both
experienced major packet loss.

Regards,
Bill

Definitely the direction I'm going. Even aside from the traceroutes,
I'm going to capture some regular web traffic to see what is happening.
Planning to send traffic to a machine I control to see if any packets
are actually making it through at all.

I'm not sure if this new Linksys router has any packet capture ability
that is exposed to the end user, but I'd also love be able to see what's
actually going through the router itself.

Thanks,
Ryan

On many non-windows OS (Mac OSX, Linux, FreeBSD etc.) you can specify ICMP
traceroute using -I:

traceroute -I google.com

I wonder if this would replicate your experience with Windows tracert

To the point of Windows reporting no internet access, MS does two things to determine if the machine has internet access, as outlined here. https://technet.microsoft.com/en-us/library/cc766017(v=ws.10).aspx (I think that's still valid)

From a console, can these two machines do the http request and the dns lookup when they tell you they're offline? Can the other machines do these two things when the Windows machines can't or when the windows machines report offline?

Way back when, I have a netgear router. It ended having a limit on its
NAT translation table, and when I had too many connections going at same
time (or not yet timed out), I would lose connection. There was an
unofficial patch to the firmware (litterally a patch in code that
defined table size) to increase that table to 1000- as I recall.

Does the Linksys have a means to display the NAT translation table and
see if maybe connections are lost when that table is full and lots of
connections have not yet timed out ?

This all goes away when he reconnects his old router from what I remember...

If that is the case, then I would concentrate my effort on the new router,
and its functionality (or lack of). Could be something simple that you are
missing on it as a setting, or assuming it works a certain way when it does
not. Sometimes these devices can be counter intuitive.

Just a quick sanity check here since I know we can occasionally overlook the simple things. You have updated the firmware to the latest available version correct? Have you checked for any odd services like QoS, parental controls or an IDS? Have you tried wiping it to factory default and reconfiguring it?

What happens if you give the affected machine a new IP? Could it be some service on the device affecting that specific IP?