Consumer networking head scratcher

Next -->

You could have one or more botted machines launching outbound DDoS attacks, potentially filling up the NAT translation table and/or getting squelched by your broadband access provider with layer-4 granularity. And the boxes themselves could be churning away due to being compromised (look at CPU and memory stats over time). Aggressive horizontal scanning is often a hallmark of botted machines, and it can interrupt normal network access on the botted hosts themselves.

I don't actually think that's the case, given the symptomology you report, but just wanted to put it out there for the list archive.

What about DNS issues? Are you sure that you really have a networking issue, or are you having intermittent DNS resolution problems caused by flaky/overloaded/attacked recursivs, EDNS0 problems (i.e., filtering on DNS responses > 512 bytes), or TCP/53 blockage? Different host OSes/browsers/apps exhibit differing re-query characteristics. Are the Windows boxes and the other boxes set to use the same recursors? Can you resolve DNS requests during the outages?

Are your boxes statically-addressed, or are they using DHCP? Periodically-duplicate IPs can cause intermittent symptoms, too. If you're using the consumer router as a DHCP server, DHCP-lease nonsense could be a contributing factor.

Are the Windows boxes running some common application/service which updates and/or churns periodically? Are they members of a Windows workgroup? All kinds of strange name-resolution stuff goes on with Windows-specific networking.

Also, be sure to use -n with traceroute. tcptraceroute is useful, too. netstat -rn should work on Windows boxes, IIRC.

This reminded me of another possibility related to NAT table
exhaustion. Are you running a full recursive resolver on a system
behind the NAT? Especially one like unbound possibly w/dnssec? I had
some strange issues caused during the time when unbound was priming
its cache from a cold start...

Nat translation limits might not only be related to his first hop nat device
In the home, but these days with the exhaustion of ipv4, the second hop
carrier grade nat (cgnat) device in his upstream provider could be limiting

I run a cgnat for an isp and allow 2500 ports per customer private address,
and time out those translations at 120 seconds. It's possible to hit a
limit there. I see it sometimes.


It isn't a DNS issue as trying to access resources via IP address
directly also have the issue.

What became clear to me last night is that this actually also impacts my
Mac, and that it has to do with traffic not properly making it back to
my machines. When the issue occurs, my traffic makes it out to the
destination, the destination responds, but that packet never makes it to
my laptop, for example. I tested by sending traffic to a server I
control and doing PCAPs on both ends.