How common is lack of DNS server diversity?

Mice and Men found that 38% of the .COM domains surveyed
had all their name servers on the same subnet. And 75%
had one or more configuration errors.

http://www.menandmice.com/dnsplace/healthsurvey.html

DNS, like most databases, suffers from information entropy.

In other words, it takes a lot of energy to keep information
correctly updated while it is being changed. Anyone who has
been Hostmaster for even a moderately sized ISP knows there
is an amazing number of ways for people to mess up any of the
pieces of data required to make the whole thing work.

As several people pointed out, you can't really assume close
IP addresses are in fact topologically close on the network.

For example, if you look at the name severs for GENUITY.NET

  Domain servers in listed order:

   DNSAUTH1.SYS.GTEI.NET 4.2.49.2
   DNSAUTH2.SYS.GTEI.NET 4.2.49.3
   DNSAUTH3.SYS.GTEI.NET 4.2.49.4

They appear to be closely related. However, the addresses are
in fact routed to very diverse locations on Genuity's network.

You will find the same thing if you look at the name servers
for UU.NET

Domain servers in listed order:

   AUTH00.NS.UU.NET 198.6.1.65
   AUTH60.NS.UU.NET 198.6.1.181

These servers are also geographically diverse.

So I'm not sure if the 38% number is a true indication of how
much diversity DNS servers have.

And what happens if the 4.0.0.0/8 route is flapped from the
routing table? No more DNS. So you still want route diversity
that isn't in the same block or aggregated block.

Then I guess you try and get a bunch of /24's for your name servers
but they might get filtered elsewhere by someone else.

Thomas

Sean Donelan wrote:

For example, if you look at the name severs for GENUITY.NET

  Domain servers in listed order:

   DNSAUTH1.SYS.GTEI.NET 4.2.49.2
   DNSAUTH2.SYS.GTEI.NET 4.2.49.3
   DNSAUTH3.SYS.GTEI.NET 4.2.49.4

They appear to be closely related. However, the addresses are
in fact routed to very diverse locations on Genuity's network.

However the 4/8 route is what is advertised to the world, and there
are certainly ocassions where that route fails to be propagated.

It's more diverse than adjacent nodes on an ethernet,
but hardly as diverse as would be ideal.

Ideally, all DNS servers for a site shouldn't be in the same autonomous
system.

--jhawk
  (who recently made the observation that there are VBNS-connected root
   nameservers, but not VBNS-connected gtld servers, so a hypotehtical
   site with a VBNS connection and a commodity connection has great
   difficulty using their VBNS connection to resolve VBNS names when
   the commodity connection goes down)

Then it probably doesn't matter if you resolve their DNS, because you won't be
getting to any of their services anyway.

Only if all of their services are in 4.0.0.0/8 What if they're providing
DNS services to customers who are not in 4.0.0.0/8 space, and who's route
hasn't flapped?

All:
  I have a related question that may be dated at this point but of which
I'm curious. Some time ago we had a problem with a DNS server we located
on a totally separate network to achieve DNS server diversity. At one
point there was a failure on that network so that our DNS server loacted
there could not be reached. It appeared from the reports/complaints we
received that a number of client systems/resolvers had decided to only
request data from the nonfunctional DNS server and despite failing on that
wouldn't ask our other listed DNS servers. They therefore could not
resolve addresses for otherwise functional network assets. I seem to
remember this was somehow related to systems running Microsoft OS's.
  Am I confused or could it be that Microsoft knows something about this?

Chuck Scott

I have heard from numerous sources that even if you provide multiple name
servers in windows 9x tcp/ip config, only the first is used. Not sure how
true it is currently. It seems from below you are talking about
resolvers, not authorative name servers??

  Brian

Brian:
  Yes, talking about resolvers. If I remember the incident correctly, at
the time a number of other nearby providers were using NT servers and it
was those which failed to ask another authoritative name server for the
domain when a particular server was not reachable. In our case, it was
actually the second server listed for our domains.

Chuck

i experienced this exact same thing, and it was the secondary ns that
NT was "fixating" on when making queries. (the secondary was up and
down for a few weeks until a new one was shipped out--yes, off-site
and off-AS :wink:

i had a VERY hard time explaining to NT professionals that their email
to our domains shouldn't be bouncing, and that 99% of the internet
could get mail to our domains just fine with one operating nameserver.
i also didn't have any proof that NT didn't do The Right Thing, and no
one wanted to help me prove it by hanging on the phone with me after
complaining that "your nameservers are down." is this misbehavior of
NT documented anywhere? is it fixable? i don't know d*ck about NT,
but i'd love to be able to at least suggest a fix and give someone a
URL.

thanks!

deeann m.m. mikula

network administrator
telerama internet -- http://www.telerama.com
abuse@telerama.com/spam@telerama.com
1.877.688.3200x501

i experienced this exact same thing, and it was the secondary ns that
NT was "fixating" on when making queries. (the secondary was up and
down for a few weeks until a new one was shipped out--yes, off-site
and off-AS :wink:

i had a VERY hard time explaining to NT professionals that their email
to our domains shouldn't be bouncing, and that 99% of the internet
could get mail to our domains just fine with one operating nameserver.
i also didn't have any proof that NT didn't do The Right Thing, and no
one wanted to help me prove it by hanging on the phone with me after
complaining that "your nameservers are down." is this misbehavior of
NT documented anywhere? is it fixable? i don't know d*ck about NT,
but i'd love to be able to at least suggest a fix and give someone a
URL.

The DNS resolver for normal run-of-the-mill lookups handles failover
properly. If anything, it is too ambitious. The algorithm suggested in RFC
1035 is to "wait 5 seconds" for a timeout before trying another server,
while with WinSock-2 resolvers, the timeout threshold is one second, and
then multiple unique queries are sent shotgun-fashion to ALL of the other
servers simultaneously. The aggressiveness level is a matter of
administrative taste: when a query is for a name in a slow remote zone,
the shotgun approach is annoying. When the server is kaput, five seconds
can be too long.

The NT4 DNS server is not this aggressive when it does failover queries
against remote zones. It waits a few seconds for responses to come back
and even ignores ICMP Destination Unreachable Port Unreachable errors
(generated when the DNS server is administratively down but the server is
still running). Note that ignoring ICMP errors is not uncommon, the stock
Linux resolver also does it, while Solaris and a few others do the right
thing.

Anyway, it is possible to get into a situation where the DNS resolver on a
WinSock-2 system agressively fails out while the local DNS server is still
searching for an answer. In truth everything is doing what it is supposed
to do, just that the resolver does it too fast sometimes.

Deeann:
  Yep, that sounds right. In our case it was also the secondary and I'm
pretty sure it was in fact some NT servers and as you say they were
"fixating" on the secondary and wouldn't ask the primary even though they
had the ns data. Guess I wasn't loosing it afterall.

Chuck