QWest is having some pretty nice DNS issues right now

Apparently they have lost two authoritive servers. ETA is unknown.

-Wil

You forgot to mention that they only have two authoritive servers for
most of their domains...

Well, that would explain it, make me feel better that they took themselves out as well:

-bash-2.05b$ dig qwest.com
; <<>> DiG 9.3.1 <<>> qwest.com
;; global options: printcmd
;; connection timed out; no servers could be reached

-Wil

william(at)elan.net wrote:

Partially back up, I can resolve everthing here... They tell me that it's not quite over yet.

-Wil

Wil Schultz wrote:

Well, that would explain it, make me feel better that they took
themselves out as well:

-bash-2.05b$ dig qwest.com
; <<>> DiG 9.3.1 <<>> qwest.com
;; global options: printcmd
;; connection timed out; no servers could be reached

not anycasted then eh? bummer :frowning:

I didn't look at this while it was happening, and haven't talked to anybody else about it, so I don't know if this was a systems or routing issue. But, in the spirit of trying to learn lessons from incomplete information...

Qwest.net and Qwest.com have two authoritative name server addresses listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net. As the names imply, traceroutes to these two servers appear to go to somewhere in the DC area and somewhere in proximity to Sunnyvale, California. It appears they're really just two servers or single location load-balanced clusters, and not an anycast cloud with two addresses. It may be that two simultaneous server failures would take out the whole thing, or they may be in less visible load balancing configurations. Even if it's two individual servers, that's the standard n+1 redundancy that's generally considered sufficient for most things.

There is a fair amount of geographic diversity between the two sites, which is a good thing.

The two servers have the IP addresses 205.171.9.242 and 205.171.14.195. These both appear in global BGP tables as part of 205.168.0.0/14, so any outage affecting that single route (flapping, getting withdrawn, getting announced from somewhere without working connectivity to the two name servers, etc.) would take out both of them.

So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy. While it's tempting to make fun of Qwest here, variations on this theme -- working hard on one area of design while ignoring another that's also critical -- are really common. It's something we all need to be careful of.

Or, not having seen what happened here, the problem could have been something completely different, perhaps even having nothing to do with routing or network topology. In that case, my general point would remain the same, but this would be a bad example to use.

-Steve

Apparently they have lost two authoritative servers. ETA is unknown.

You forgot to mention that they only have two authoritative servers for
most of their domains...

[snip]

So from my uninformed vantage point, it looks like they started doing this
more or less right -- two servers or clusters of servers in two different
facilities, a few thousand miles apart on different power grids and not
subject to the same natural disasters. In other words, they did the hard
part. What they didn't do is put them in different BGP routes, which for
a network with as much IP space as Qwest has would seem fairly easy.
While it's tempting to make fun of Qwest here, variations on this theme --
working hard on one area of design while ignoring another that's also
critical -- are really common. It's something we all need to be careful
of.

Or, not having seen what happened here, the problem could have been
something completely different, perhaps even having nothing to do with
routing or network topology. In that case, my general point would remain
the same, but this would be a bad example to use.

-Steve

At some point in a carrier's growth, Anycast DNS has got to become a best
practice. Are there many major carriers that don't do it today, or am I just
a starry-eyed idealist?

- Dan

having authoritative data secondaried off-net is pretty important.

randy

Steve Gibbard wrote:

So from my uninformed vantage point, it looks like they started doing this more or less right -- two servers or clusters of servers in two different facilities, a few thousand miles apart on different power grids and not subject to the same natural disasters. In other words, they did the hard part. What they didn't do is put them in different BGP routes, which for a network with as much IP space as Qwest has would seem fairly easy.

I didn't get to play detective at the time of the outage, but configutation (which is automatically replicated) may also have been enough to take out both nameservers.

It also makes good management sense to run your nameservers with the same software and versions, but perhaps it doesn't make good continuity sense.. ?

cheers
-a

I'll happily make fun of them. If the authoritative DNS servers were in the
same logical network, even if one was in Washington, and one in California,
they'd deserve it.

Use to do basic audit networks for end user companies (and one small ISP who
bought the service), this was a standard checklist item. Literally are the
authoritative name servers on different logical networks. GX networks did it.
Demon Internet did it, we do it for our own hosting despite being a
relatively small company, I'm sure most of NANOG readership are careful to do
this.

I think the comments on anycast are misplaced, most big ISPs use it, or
similar, for internal recursive resolvers, but I don't think it is that
crucial for authoritative servers. Of course placing all your authoritative
nameservers in the same anycast group is one of the things I've complained
about here before (not mentioning any TLD by name since they seem to have
learnt from that one), so of itself anycast doesn't avoid the issue. You can
make the same mistake in many different systems.

Also some scope for longer TTL at Qwest, although I can't throw any stones as
we have been busy migrating stuff to new addresses and using very short TTLs
ourselves at the moment. But we'll be back to 86400 seconds just as soon as I
finish the migration work.

I do agree the management issue with DNS are far harder, and here longer TTL
are a double edged sword. But it is hard to design a system where the
mistakes don't propagate to every DNS server, although some of the common
tools do make it easier to check things are okay before updates are unleased.

I think there is scope for saying the DNS TTLs should be related (and greater
than) the time it takes to get clue onto any DNS problem.

What's interesting to me, atleast, is that this is about the 5th time
someone has said similar things in the last 6 months: "DNS is harder than
I thought it was" (or something along that line...)

So, do most folks think:
1) get domain-name
2) get 2 machines for DNS servers
3) put ips in TLD system and roll!

It seems like maybe that is all too common. Are the 'best practices'
documented for Authoritative DNS somewhere central? Are they just not well
publicized? Do registrars offer this information for end-users/clients? Do
they show how their hosted solutions are better/works/in-compliance-with
these best practices? (worldnic comes to mind)

Should this perhaps be better documented and presented at a future NANOG
meeting? (and thus placed online in presentation format)

-Chris

IETF tech transfer failure... see RFC 2870 (mislabled as
  root-server) for TLD zone machine best practices from several
  years ago... for even older guidelines ... RFC 1219.

--bill

Will somebody who has the OReilly DNS book handy check and see if Chapter 8
doesn't already cover this?

http://www.oreilly.com/catalog/dns4/

If it doesn't, maybe we need to hint to the authors that an update is needed
for the 5th edition.

If it does, I suspect the basic problem runs much deeper, and can't be solved
by a NANOG presentation put online...

Perhaps this falls under: "better documented" or "easy to find" or "not
publicized" ? I'd be interested to see how many DNS hosting providers
actually follow these themselves. Take EasyDNS for example (since they are
on my mind, due to their GOOD service actually):

easydns.com. 3600 NS ns1.easydns.com.
easydns.com. 3600 NS ns2.easydns.com.
easydns.com. 3600 NS remote1.easydns.com.
easydns.com. 3600 NS remote2.easydns.com.
NS1.easydns.com. 3600 A 216.220.40.243
NS2.easydns.com. 29449 A 209.200.151.4
remote1.easydns.com. 29434 A 209.200.131.4
remote2.easydns.com. 29428 A 205.210.42.20

CIDR: 205.210.42.0/24
NetName: SHMOOZE-NET
prolexic/Prime Communications Ltd. DONBEST (NET-209-200-131-0-1)
                                  209.200.131.0 - 209.200.131.255
NetRange: 216.220.32.0 - 216.220.63.255
CIDR: 216.220.32.0/19
NetName: Q9-NET1
NetRange: 209.200.128.0 - 209.200.191.255
CIDR: 209.200.128.0/18
NetName: PROLEXIC

So, 4 ips, 3 ISP's 3 route objects... they seem to atleast follow some of
the requirements.

-Chris

What's interesting to me, atleast, is that this is about the 5th time
someone has said similar things in the last 6 months: "DNS is harder than
I thought it was" (or something along that line...)

So, do most folks think:
1) get domain-name
2) get 2 machines for DNS servers
3) put ips in TLD system and roll!

It seems like maybe that is all too common. Are the 'best practices'
documented for Authoritative DNS somewhere central? Are they just not well
publicized? Do registrars offer this information for end-users/clients? Do
they show how their hosted solutions are better/works/in-compliance-with
these best practices? (worldnic comes to mind)

Should this perhaps be better documented and presented at a future NANOG
meeting? (and thus placed online in presentation format)

Also it should be noted that there's a general lack of understanding about how very crucial DNS resolver performance is in the end user/customer perception of a network's performance. I can't tell you how many times I've used a local resolver, even on a modem mind you, and seen a dramatic improvement in the end user experience, which is, the web browser. Other applications are pretty DNS bound too anymore. And many large ISPs overload their resolvers, or have resolvers not prepared/configured to handle the amount of queries they're getting. I'm not saying I know the answers there, I'm just saying that I've seen quite a few times where DNS (or even other central directories, LDAP, ActiveDirectory come to mind) have been the 'bottleneck' from a user standpoint since name resolution would take so long.

It seems like maybe that is all too common. Are the 'best practices'
documented for Authoritative DNS somewhere central?

2182

yes, yes.. people who care (a lot) have read this I'm sure... I was aiming
a little lower :slight_smile: like folks that have enterprise networks :slight_smile: Or, maybe
even registrars offering 'authoritative dns services' like say 'worldnic'
who had most of their DNS complex shot in the head for 3 straight days :frowning:

in deference to the previous RFC editor, who was particular about these
things, the proper form is; RFC 2182.

--bill

It is the old story of ignorance and cost, plus with DNS a "perceived loss of
control".

In the UK many domains are registered with a couple of the cheapest providers,
who do not do off network DNS, and in the past one offered non-RFC compliant
mail forwarding as a bonus. I've seen people switch the DNS part of a hosting
arrangement to these guys to save about 10 USD a year. Of course people
competing at those sort of price levels offer practically no service
component, so even if nothing dreadful happens it still turns into a false
economy.

It reminds me of the firewall market, when the average punter had no idea how
to assess the "security" aspects of a firewall, and so firewall vendors ended
up pushing throughput, and price, as the major selling points. I know people
who bought firewalls capable of handling 160Mbps of traffic, who still have
it filtering a 2Mbps Internet connection, badly.

By and large the big ISPs do a good job with DNS, the end users do a terrible
job. I think once you get to the size where you need a person (or team) doing
DNS work fulltime, it probably gets a lot easier to do it right.

Perhaps I should dust off my report on the quality of DNS configurations in
the South West of England, and turn it into a buyers guide?

That said I don't think doing DNS right is easy. I know pretty much exactly
what my current employer is doing wrong, but these failures to conform to
best practice aren't as much of a priority as the other things we are doing
wrong. At least in our case it is done with knowledge of what can (and likely
will eventually) go wrong.

Whatever the logical division of IP routing is.

On the internet this is usually AS number, but the network engineers might
know of linkages between different networks at the same organisation. So it
is probably wise to host one DNS server somewhere completely unrelated -
different IX, different continent etc.