Intermittent incorrect DNS resolution?

Hi everyone,

I'm having an unusual DNS problem and would appreciate feedback.

For the zones in question, primary DNS is provided by GoDaddy and
secondary DNS by DNS Made Easy. Over a week ago we made changes to
several A records (including wildcards on two different zones), all
already having a TTL no greater than one hour.

The new IPs on those A records have taken many millions of requests
since the changes. Occasionally, a small amount of traffic appears at
the old IPs that those A records had. This is HTTP traffic. Packet
captures of this traffic show various Host headers.

Attempting to resolve those various Host headers from various networks
in Canada against various random private and public resolvers and
against the authoritative NSs all yield correct results (i.e. new IPs).

However, both GoDaddy and DNS Made Easy use anycast, which makes it less likely that I can see the entire picture of what's happening.

I suspect that somewhere, one of their servers has the wrong data, or
some resolver is misbehaving, but based on the pattern/traffic/volume/randomization of hostnames, the resolver theory is less likely. I haven't analyzed the source IPs yet to see if they're in a particular set of countries.

I've opened a ticket with DNS Made Easy and they replied very quickly
suggesting the problem is not with them. I've opened a ticket with
GoDaddy and...well, it's GoDaddy, so I don't expect much (no response yet).

Any ideas? Can folks try resolving eriktest.uberflip.com and post
here with details only if it resolves to an IP starting with 76.9 (old IPs)?

Thanks

Erik

The other likely cause of this is local cacheing nameservers somewhere
at some ISP or major site, that do not respect TTL values for some
reason.

This is sadly a common problem - not statistically, most nameservers
do the right thing, but if you run big sites and flip things, there's
always a long tail of people whose nameservers just didn't get it.

for d in $(seq 1 1000); do dig @pdns01.domaincontrol.com.
eriktest.uberflip.com >> /tmp/tst ; dig @pdns02.domaincontrol.com.
eriktest.uberflip.com >> /tmp/tst ; dig @ns5.dnsmadeeasy.com.
eriktest.uberflip.com >> /tmp/tst ; dig @ns6.dnsmadeeasy.com.
eriktest.uberflip.com >> /tmp/tst; dig @ns7.dnsmadeeasy.com.
eriktest.uberflip.com >> /tmp/tst ; done

you tried something like this already I assume?

Also client programs don't always honor TTLs either. For example, JAVA
defaults to ignoring TTLs and holding IPs forever.

*networkaddress.cache.ttl (default: -1)*
Indicates the caching policy for successful name lookups from the name
service. The value is specified as as integer to indicate the number of
seconds to cache the successful lookup. A value of -1 indicates "cache
forever".

Depending on who your clients are, your milage may vary.

.r'

Yes, though I tried way less than 1000 in the loop.

:slight_smile:

given a large list of recursives you could even test resolution
through a bunch of recursive servers...

Good point.

While I haven't checked the distribution of source IPs yet, I briefly grepped for the User-Agent headers in the tcpdump output, and there's a higher than expected bot presence, particularly Baidu.

That said, there are also "normal" UAs (whatever that means, with every device/software pretending to be something else these days).

True...I did try 4.2.2.2 / 8.8.8.8 and some local ones here. All looked fine.

With anycast / DB and other backend clusters / load balancing / whatever else behind the scenes, it's hard to get a good idea of what's actually happening.

Might be stuck with running this infra for a while longer and seeing if the traffic disappears eventually.

Hi Erik,

Look up "DNS pinning." I can't rule out the possibility of a faulty
DNS server but it's far more likely your customers' web browsers are
simply ignoring the DNS TTL. Malfunctioning as designed. If you keep
it live, you'll notice a falling trickle of traffic to the old address
for the next year or so.

Regards,
Bill Herrin

Consider the possibility that some end users (or even corp networks) may
have hardcoded your hosts' translation into their hosts files or perhaps
corporate proxy firewalls that allow access onto to whitelisted web sites.

They will continue to point to the old IP addresses until you shutdown
the service and they call you to inquire why *you* broke your system :slight_smile:

If this HTTP service has a GUI (aka: web page versus credit card
transactions), you should put up a warning on the web page whenever it
is being accessed via the old IP address.

I'm a touch surprised to find that no one has mentioned the facet of
Windows OSs that requires "ipconfig /flushdns" in some such circumstances...

Not only may *browsers* be caching DNS lookups without regard to TTLs,
the *OS* might be doing it to you too, in circumstances I was never quite
able to get a handle on.

XP was known to do this, as late as SP3; I'm not sure about V or 7.

Cheers,
-- jra

From nanog-bounces+bonomi=mail.r-bonomi.com@nanog.org Wed Jan 16 18:21:21 2013
Date: Wed, 16 Jan 2013 19:16:57 -0500 (EST)
From: Jay Ashworth <jra@baylink.com>
To: NANOG <nanog@nanog.org>
Subject: Re: Intermittent incorrect DNS resolution?

I'm a touch surprised to find that no one has mentioned the facet of
Windows OSs that requires "ipconfig /flushdns" in some such
circumstances...

Winbloze has to be rebooted frequently enough that this is rarely an issue.
<wry grin>

I sent queries from 270+ different locations for the domains you mentioned off-list and I didn't see any inconsistencies. The persistent host-caching/browser-caching theories seem like your best bet (or my 270+ locations weren't sufficiently diverse to catch a stale zone being served by an anycast authority server).

Joe

Thanks Joe and thanks everyone else for the on and off-list replies. Quite insightful.

I think we've reached the consensus that the problem is the ignoring of TTLs as opposed to misbehaving/stale authoritative servers. So for now I shall wait.

To give an idea of the scale of the problem right now, I'm getting thousands of requests per minute to a new IP vs. about two requests per minute on the equivalent old IP, with over 60% of the latter being Baidu, but also a bit of Googlebot and other random bot and non-bot UAs.

Perhaps next week I'll unbind some old IPs for a few minutes to see what happens.

It's common for malware to spoof the Googlebot user-agent since they know
most webmasters won't block it. You might want to check whether the IPs
you're seeing it from are really allocated to us -- if so, I'd be
interested in tracking down why we're crawling your old IP.

Damian

Thanks Damian. I see four requests with Google UAs from actual Google IPs, 66.249.73.45 and 66.249.73.17 (PTR and rwhois seem yours for both), in a period of 30 minutes (compared to over 80 per minute on the new IPs). This is pretty low, so I'm not too worried.

Baidu is the main culprit now; there's little other traffic. In fact, we're getting no traffic from Baidu on the new IPs, only to the old ones. I've already e-mailed their spider help e-mail, but it's fallen on deaf ears.

Erik

Upon further investigation, in this particular Google case, it seems to be a customer's CNAME to a record of theirs which is an actual A record to our old IP, contrary to our instructions (we tell everyone to CNAME us, so we can change IPs as we wish, which we've done for the first time this year). So there is no Google problem.

Just an FYI...

Every version of Windows since Windows 2000 (sans Windows Me) has had the DNS Client service which maintained this caching function. This was by design due to the massive dependency on DNS resolution which Active Directory has had since its creation. It greatly reduced the amount of repetitive lookups required thereby speeding up AD based functions and lessening the load on DNS servers. It still exists today up through Windows 8. You can disable the service, but it will also break DDNS updates unless your DHCP server registers hostnames on behalf of your clients.

- -Vinny

Just an FYI...

Every version of Windows since Windows 2000 (sans Windows Me) has had the DNS Client service which maintained this caching function. This was by design due to the massive dependency on DNS resolution which Active Directory has had since its creation. It greatly reduced the amount of repetitive lookups required thereby speeding up AD based functions and lessening the load on DNS servers. It still exists today up through Windows 8. You can disable the service, but it will also break DDNS updates unless your DHCP server registers hostnames on behalf of your clients.

- -Vinny

Microsoft broke the Internet just to make their internal networking
work properly?

I'm shocked; *shocked* I tell... yes, just put the money right over there;
*shocked* I say.

You can't imagine how much time that lost me in diagnoses when it first
came out, until we finally located it somewhere on the Internet.

Cheers,
-- jra