Spain was offline

He colleagues,

Spain (at least the .es-part) was offline nobody reported it...?
What's going on? In the past you were faster...

Gunther

<tongue in cheek>

Spain has no major IRC servers that I'm aware of, and most of the good
porn is hosted in the US. Link up a Spanish server to EFnet or start
hosting porn, it'll get reported here much faster.

</tic>

- billn

DNS operational problems were briefly discussed on the DNS operations
mailing list earlier.

Although there are regular attempts to tie network routing and domain
name services together, routing problems are still relatively independent
of naming problems. But it may be a good opportunity for name server
operators and network operators to double-check the processes and networks
they use to fix name services and network services don't have unexpected
co-dependencies.

Do you know how to contact your network provider without looking up
e.g. www.example.com (network provider web site)? IS TCPWRAPPER configured to lookup names before allowing an operator login on a
critical server? Do you know your name servers IP addresses? If
the PSTN phone numbers don't work, do you have a INOC-DBA phone? If
the INOC-DBA phone numbers don't work, do you have a PSTN phone number?

Do you know how to contact your network provider without looking up
e.g. www.example.com (network provider web site)? IS TCPWRAPPER
configured to lookup names before allowing an operator login on a
critical server? Do you know your name servers IP addresses? If
the PSTN phone numbers don't work, do you have a INOC-DBA phone? If
the INOC-DBA phone numbers don't work, do you have a PSTN phone number?

Do you have your own mirrors of TLDs that are
important to your users, i.e. .com, your .xx
country domain, etc.?

--Michael Dillon

You seem to be suggesting that ISPs run stealth slaves for these kinds of zones. This may have been a useful pointer for ISPs in days gone by, but I think today it's impractical advice.

ccTLD managers these days either already restrict zone transfers for privacy reasons, or are being encouraged to do so as a matter of best practice. Established gTLD zones like COM are sufficiently large and are updated so frequently that even if they were made available for AXFR the chances are good that most ISPs would struggle to host the zone, and any local instance would provide degraded service to their customers instead of the improvements in performance that presumably were the point of the exercise.

Even where zone transfers are available and ISPs are able to run stealth servers there is always the risk that master server ACLs (or the master servers themselves) will change without warning, leaving the stealth slave serving authoritative but stale data, which is guaranteed to make the helpdesk phone ring sooner or later.

For zones that are being made available on anycast servers, ISPs may be able to lobby/pay the zone operator to install an anycast instance in their network. However, in general, the days of ISPs being able to set these things up on their own and see benefit from them are past, in my opinion.

Joe

You seem to be suggesting that ISPs run stealth slaves for these
kinds of zones.

Not really. In today's world such simplistic solutions
don't work.

For zones that are being made available on anycast servers, ISPs may
be able to lobby/pay the zone operator to install an anycast instance
in their network. However, in general, the days of ISPs being able to
set these things up on their own and see benefit from them are past,
in my opinion.

I believe that there are still some things that ISPs can
do which cannot simply be bought on the market. For instance,
most ISPs runs simple caching servers for their DNS queries
where they keep any responses for a short time before deleting
them. It's so simple that it is built into DNS relays as an
option.

An ISP could run a modified DNS relay that replicates all
responses to a special cache server which does not time out
the responses and which is only used to answer queries when
specified domains are unreachable on the Internet.

For instance, if you specified that all .es responses were
to be replicated to the cache and that your DNS relay should
divert queries to the cache when .es nameservers are *ALL*
unreachable, then the impact of this type of outage is greatly
reduced. You could specify important TLDs to be cached this way
as well as important domains like google.com and yahoo.com.
The actual data cached would only be data that *YOUR* customers
are querying anyway. In fact, you could specify that any domain
which receives greater than x number of queries per day should
be cached in this way.

The volume of data cached would be so small in todays terms that
it only needs a low-end 1U (or single blade) server to handle
this.

Since nothing like this exists on the market, the only way
for ISPs to do this is to roll their own. Of course, it is
likely that eventually someone will productize this and then
you simply buy the box and plug it in. But for now, this is the
type of thing that an ISP has to set up on their own.

--Michael Dillon

* Michael Dillon:

The volume of data cached would be so small in todays terms that
it only needs a low-end 1U (or single blade) server to handle
this.

The working set is larger than you think, I fear. I've been running
something like this since summer 2004, and the gigabytes pile up
rather quickly if you start with an empty database. If you restrict
yourself to A records for plain SLDs and SLDs prefixed with "www.",
the task becomes somewhat easier (because you get rid of all that
PTR-related stuff, and the NS RRs take their share, too). Of course,
you can squeeze quite a bit of RAM into one rack unit, so your comment
probably isn't that far off in the end. :sunglasses:

Since nothing like this exists on the market, the only way
for ISPs to do this is to roll their own. Of course, it is
likely that eventually someone will productize this and then
you simply buy the box and plug it in. But for now, this is the
type of thing that an ISP has to set up on their own.

Well, the data I collect is not authoritative enough for that purpose.
My intent was to capture everything that could be served to some host
on the network, while taking the possibility of broken resolvers into
account. That's why I store the data without verifying its
authenticity (which is generally very hard to do because DNS is not
globally consistent). Plugging things directly into the caching
resolver would give you access to its verification logic, but ISPs
aren't really fond of doing this to their resolvers.

[snip]

An ISP could run a modified DNS relay that replicates all
responses to a special cache server which does not time out
the responses and which is only used to answer queries when
specified domains are unreachable on the Internet.

For instance, if you specified that all .es responses were
to be replicated to the cache and that your DNS relay should
divert queries to the cache when .es nameservers are *ALL*
unreachable, then the impact of this type of outage is greatly
reduced. You could specify important TLDs to be cached this way
as well as important domains like google.com and yahoo.com.
The actual data cached would only be data that *YOUR* customers
are querying anyway. In fact, you could specify that any domain
which receives greater than x number of queries per day should
be cached in this way.

From what I've inferred from some other comments about the

failure, it seems that the DNS servers were not unreachable
or otherwise unavailable, but rather that the .es zone data
was corrupted.

Such a failure wouldn't trip the backup system you describe.
How does an automated system tell the difference between
a "real" NXDOMAIN and a erroneous one? It would take human
intervention to turn it on in many potential failure modes.
How much of a window is there where the ISP can positively
identify a failure and start up their backup before the TLD,
or whatever external DNS entity, gets its own act together?