That MIT paper

what i meant by "act globally, think locally" in connection with That
MIT Paper is that the caching effects seen at mit are at best
representative of that part of mit's campus for that week, and that
even a variance of 1% in caching effectiveness at MIT that's due to
generally high or low TTL's (on A, or MX, or any other kind of data)
becomes a huge factor in f-root's load, since MIT's load is only one
drop in a larger ocean. see duane's paper, which is more of a "think
globally, act locally" kind of thing. how much of the measured traffic
was due to bad logic in caching/forwarding servers, or in clients? how
will high and low ttl's affect bad logic that's known to be in wide
deployment? what if 20,000 enterprise networks the size of MIT all
saw a 1% decrease in caching effectiveness due to generally low TTL's?
(what if a general decline in TTL's resulted from publication of That
MIT Paper?)

here's a snapshot of f-root's life. That MIT Paper not only fails to
address it, and fails to take it into account, it fails to identify
the global characteristic of the variables under study. caching
performance is not simply a local issue. everyone connected to the
internet acts globally. it is wildly foolish to think locally.

16:44:35.118922 208.139.64.98.12978 > 192.5.5.241.53: 16218 AAAA? H.ROOT-SERVERS.NET. (36)
16:44:35.121171 208.139.64.98.12978 > 192.5.5.241.53: 10080 A6? H.ROOT-SERVERS.NET. (36)
16:44:35.124668 208.139.64.98.12978 > 192.5.5.241.53: 1902 AAAA? C.ROOT-SERVERS.NET. (36)
16:44:35.127544 208.139.64.98.12978 > 192.5.5.241.53: 10098 AAAA? G.ROOT-SERVERS.NET. (36)
16:44:35.130185 208.139.64.98.12978 > 192.5.5.241.53: 6010 A6? C.ROOT-SERVERS.NET. (36)
16:44:35.133828 208.139.64.98.12978 > 192.5.5.241.53: 1920 A6? G.ROOT-SERVERS.NET. (36)
16:44:35.136286 208.139.64.98.12978 > 192.5.5.241.53: 12169 AAAA? F.ROOT-SERVERS.NET. (36)
16:44:35.139433 208.139.64.98.12978 > 192.5.5.241.53: 3988 A6? F.ROOT-SERVERS.NET. (36)
16:44:35.142324 208.139.64.98.12978 > 192.5.5.241.53: 10140 A6? B.ROOT-SERVERS.NET. (36)
16:44:35.145453 208.139.64.98.12978 > 192.5.5.241.53: 14244 AAAA? B.ROOT-SERVERS.NET. (36)
16:44:35.149344 208.139.64.98.12978 > 192.5.5.241.53: 16297 A6? J.ROOT-SERVERS.NET. (36)
16:44:35.151674 208.139.64.98.12978 > 192.5.5.241.53: 1968 AAAA? J.ROOT-SERVERS.NET. (36)

On Wed, Aug 11, 2004 at 04:49:18PM +0000, Paul Vixie scribed:

what i meant by "act globally, think locally" in connection with That
MIT Paper is that the caching effects seen at mit are at best
representative of that part of mit's campus for that week, and that

  Totally agreed. The paper was based upon two traces, one from
MIT LCS, and one from KAIST in Korea. I think that the authors
understood that they were only looking at two sites, but their
numbers have a very interesting story to tell -- and I think that
they're actually fairly generalizable. For instance, the rather
poorly-behaving example from your f-root snapshot is rather consistent
with one of the findings in the paper:

  [Regarding root and gTLD server lookups] "...It is likely that
many of these are automatically generated by incorrectly implemented
or configured resolvers; for example, the most common error 'loopback'
is unlikely to be entered by a user"

even a variance of 1% in caching effectiveness at MIT that's due to
generally high or low TTL's (on A, or MX, or any other kind of data)
becomes a huge factor in f-root's load, since MIT's load is only one

  But remember - the only TTLs that the paper was suggesting could be
reduced were non-nameserver A records. You could drop those all to zero
and not affect f-root's load one bit. In fairness, I think this is
jumbled together with NS record caching in the paper, since most
responses from the root/gTLD servers include both NS records and
A records in an additional section.

Global impact is greatest when the resulting load changes are
concentrated in one place. The most clear example of that is changes
that impact the root servers. When a 1% increase in total traffic
is instead spread among hundreds of thousands of different, relatively
unloaded DNS servers, the impact on any one DNS server is minimal.
And since we're talking about a protocol that variously occupies less than
3% of all Internet traffic, the packet count / byte count impact is
negligible (unless it's concentrated, as happens at root and
gtld servers).

The other questions you raise, such as:

how much of the measured traffic was due to bad logic in
caching/forwarding servers, or in clients? how
will high and low ttl's affect bad logic that's known to be in wide
deployment?

are equally important questions to ask, but .. there are only so many
questions that a single paper can answer. This one provides valuable
insight into client behavior and when and why DNS caching is effective.
There have been other papers in the past (for instance, Danzig's 1992
study) that examined questions closer to those you pose. The results from
those papers were useful in an entirely different way (namely, that almost
all root server traffic was totally bogus because of client errors).

It's clear that from the perspective of a root name server operator,
the latter questions are probably more important. But from the
perspective of, say, an Akamai or a Yahoo (or joe-random dot com),
the former insights are equally valuable.

  -Dave

there are many sites and isps like mit and kaist. there are few
root servers. while i care about the root servers, i presume that
they are run by competent folk and certainly they are measured to
death (which is rather boring from the pov of most of us). i care
about isp and user site measurements. i think the study by the mit
crew, which i have read a number of times, was a real service to
the community.

randy

Paul Vixie wrote:

(what if a general decline in TTL's resulted from publication of That
MIT Paper?)

It's an academic paper. The best antedote would be to publish a nicely
researched reply paper.

Meanwhile, I'm probably one of those guilty of too large a reduction of
TTLs. I remember when the example file had a TTL of 999999 for NS.

What's the best practice?

Currently, we're using (dig result, criticism appreciated):

watervalley.net. 1d12h IN SOA ns2.watervalley.net. hshere.watervalley.net. (
                                        2004081002 ; serial
                                        4h33m20s ; refresh
                                        10M ; retry
                                        1d12h ; expiry
                                        1H ) ; minimum

watervalley.net. 1H IN MX 10 mail.watervalley.net.
watervalley.net. 1H IN A 12.168.164.26
watervalley.net. 1H IN NS ns3.watervalley.net.
watervalley.net. 1H IN NS ns1.ispc.org.
watervalley.net. 1H IN NS ns2.ispc.org.
watervalley.net. 1H IN NS ns2.watervalley.net.
watervalley.net. 1H IN NS ns3.ispc.org.

;; ADDITIONAL SECTION:
mail.watervalley.net. 1H IN A 12.168.164.3
ns1.ispc.org. 15h29m10s IN A 66.254.94.14
ns2.ispc.org. 15h29m10s IN A 199.125.85.129
ns2.watervalley.net. 1D IN A 12.168.164.2
ns3.ispc.org. 15h29m10s IN A 12.168.164.102
ns3.watervalley.net. 1H IN A 64.49.16.2

David,

* dga@lcs.mit.edu (David G. Andersen) [Thu 12 Aug 2004, 02:55 CEST]:

Global impact is greatest when the resulting load changes are
concentrated in one place. The most clear example of that is changes
that impact the root servers. When a 1% increase in total traffic
is instead spread among hundreds of thousands of different, relatively
unloaded DNS servers, the impact on any one DNS server is minimal.
And since we're talking about a protocol that variously occupies less than
3% of all Internet traffic, the packet count / byte count impact is
negligible (unless it's concentrated, as happens at root and
gtld servers).

This doesn't make sense to me. You're saying here that a 1% increase in
average traffic is a 1% average increase in traffic. What's your point?

if a load change is concentrated in one place how can the impact be
global?

How can a 1% load increase in one specific place have anything but
minimal impact?

At root and gTLD servers I assume DNS traffic occupies significantly
more than 3% of all traffic there. Still, a 1% increase remains 1%.

  -- Niels.

I was reminded about rfc1537.

Been a long time since I read that, so a good reminder. But it only
deals with SOA records. And it's 11 years old (closer to 12).

The topic at hand was NS records. Any other guidance?

On Thu, Aug 12, 2004 at 01:35:36PM +0200, Niels Bakker scribed:

* dga@lcs.mit.edu (David G. Andersen) [Thu 12 Aug 2004, 02:55 CEST]:
> Global impact is greatest when the resulting load changes are
> concentrated in one place. The most clear example of that is changes
> that impact the root servers. When a 1% increase in total traffic
> is instead spread among hundreds of thousands of different, relatively
> unloaded DNS servers, the impact on any one DNS server is minimal.
> And since we're talking about a protocol that variously occupies less than
> 3% of all Internet traffic, the packet count / byte count impact is
> negligible (unless it's concentrated, as happens at root and
> gtld servers).

This doesn't make sense to me. You're saying here that a 1% increase in
average traffic is a 1% average increase in traffic. What's your point?

if a load change is concentrated in one place how can the impact be
global?

  Because that point could be "critical infrastructure" (to abuse
the buzzword). If a 1% increase in DNS traffic is 100,000 requests
per second (this number is not indicative of anything, just an
illustration), that could represent an extra request per second per
nameserver -- or 7,000 more requests per second at the root.
One of these is pretty trivial, and the other could be
unpleasant.

At root and gTLD servers I assume DNS traffic occupies significantly
more than 3% of all traffic there. Still, a 1% increase remains 1%.

   Sure, but the ratio still plays out. If your total traffic due
to DNS is small, then even a large (percentage) increase in DNS traffic
doesn't affect your overall traffic volume, though it might hurt
your nameservers. If you're a root server, doubling the DNS traffic
nearly doubles total traffic volume, so in addition to DNS-specific
issues, you'll also start looking at full pipes.

  -Dave

> At root and gTLD servers I assume DNS traffic occupies significantly
> more than 3% of all traffic there. Still, a 1% increase remains 1%.

   Sure, but the ratio still plays out. ...

i must have misspoken. when i asked "what if 20,000 sites decreased their
cache utilization by 1% due to a general lowering of TTL's inspired by MIT's
paper" i was wondering if anyone thought that the result would be a straight
across-the-board increase in traffic at the root servers. there are theories
that say yes. other theories say it'll be higher. others, lower. any study
that fails to address these questions is worse than useless.

During a complete renumbering, I'm trying to establish a standard for
DNS records. After all, every zone file is going to have to be touched,
so we might as well update them at the same time. Various techs have
done odd things over the past decade.

Having no guidance so far from this group, despite the grumbling about
times becoming shorter and lack of analysis, I thought "Well, vixie
will know the best practice!"

I remain unenlightened. Should it be 2 days? Or 1 hour? And why the
inconsistent results? Obsolete root glue records?

A simple dig yields:

;; ANSWER SECTION:
vix.com. 2D IN NS ns-ext.vix.com.
vix.com. 2D IN NS ns1.gnac.com.

;; AUTHORITY SECTION:
vix.com. 2D IN NS ns1.gnac.com.
vix.com. 2D IN NS ns-ext.vix.com.

But a dig directly to the ns1.gnac.com or ns-ext.vix.com server yields:

;; ANSWER SECTION:
vix.com. 1H IN NS ns.lah1.vix.com.
vix.com. 1H IN NS ns.sql1.vix.com.
vix.com. 1H IN NS ns-ext.isc.org.
vix.com. 1H IN MX 10 sa.vix.com.
vix.com. 1H IN MX 20 fh.vix.com.
vix.com. 1H IN TXT "$Id: vix.com,v 1.190 2004/08/12 19:06:05 vixie Exp $"
vix.com. 1H IN A 204.152.188.231
vix.com. 1H IN SOA ns.lah1.vix.com. hostmaster.vix.com. (
                                        2004081201 ; serial
                                        1H ; refresh
                                        30M ; retry
                                        1W ; expiry
                                        1H ) ; minimum

;; AUTHORITY SECTION:
vix.com. 1H IN NS ns.lah1.vix.com.
vix.com. 1H IN NS ns.sql1.vix.com.
vix.com. 1H IN NS ns-ext.isc.org.

;; ADDITIONAL SECTION:
ns.lah1.vix.com. 1H IN A 204.152.188.234
ns.lah1.vix.com. 1H IN AAAA 2001:4f8:2::9
ns.sql1.vix.com. 1H IN A 204.152.184.135
ns.sql1.vix.com. 1H IN AAAA 2001:4f8:3::9
ns-ext.isc.org. 1H IN AAAA 2001:4f8:0:2::13
ns-ext.isc.org. 1H IN A 204.152.184.64
sa.vix.com. 1H IN A 204.152.187.1
sa.vix.com. 1H IN AAAA 2001:4f8:3:bb::1

i must have misspoken. when i asked "what if 20,000 sites decreased their
cache utilization by 1% due to a general lowering of TTL's inspired by
MIT's paper" i was wondering if anyone thought that the result would be a
straight across-the-board increase in traffic at the root servers. there
are theories that say yes. other theories say it'll be higher. others,
lower. any study that fails to address these questions is worse than
useless.

no. it may happen to be studying things other than the root servers.
and that's ok and often useful

randy

I remain unenlightened. Should it be 2 days? Or 1 hour? And why the
inconsistent results? Obsolete root glue records?

I think your first answer is from the .com gtlds which use a 2 day ttl, the
second is from vix.com's nameservers which uses 1 hour ttl for all records.

I dont know about best practice but I dont see any reason why your ns records
should be any different from the rest of your zone for which use a value which
suits you and your need to make changes (if these are your network a/ptrs
something at least 24 hours would be fine).

Having the NS records with explicitly smaller ttl wouldnt as i see it help as
any change in nameservers as made with the registry would take the time of the
registry plus the gtlds to become effective

Steve

1. It's a financial issue. In the event of an emergency or an server failure, how many hours can you financially be offline. Are your customers willing to wait up to 2 days for their DNS caches to update with the new IP address?

A very busy domain might benefit from having a higher TTL value for their nameserver's but having a lower TTL for hosts, so that you minimize your downtime, in the event of a server failure. For example, when Akamai was having DNS issues, content providers with low TTL's were able to switch to secondary nameservers faster, than zones with using a higher TTL.

2. It's a performance issue. Zones with a lower TTL have slightly higher server usage. If you set a low TTL value will your nameservers be able to handle that increased load?

Personally, I use a TTL of 4 hours. It's low enough so that in the event of a failure, I can easily migrate my hosts, but still high enough that there isn't a significant server load.

-- Matthew

* mcgehrin@reverse.net (Matthew McGehrin) [Fri 13 Aug 2004, 16:46 CEST]:

1. It's a financial issue. In the event of an emergency or an server
failure, how many hours can you financially be offline. Are your customers
willing to wait up to 2 days for their DNS caches to update with the new IP
address?

In the event of a server failure I suggest you add its IP address as an
alias to a non-deceased host. You kept backups of your master zone files
on another machine, didn't you?

A very busy domain might benefit from having a higher TTL value for their
nameserver's but having a lower TTL for hosts, so that you minimize your
downtime, in the event of a server failure. For example, when Akamai was
having DNS issues, content providers with low TTL's were able to switch to
secondary nameservers faster, than zones with using a higher TTL.

Assuming you're talking about a specific incident not too long ago:

To me it looked more like those who had actually spent thought on what
to do in the case of a large, longer Akamai failure had less impact when
that failure occurred.

  -- Niels.

"Stephen J. Wilcox" wrote:

> I remain unenlightened. Should it be 2 days? Or 1 hour? And why the
> inconsistent results? Obsolete root glue records?

I think your first answer is from the .com gtlds which use a 2 day ttl, the
second is from vix.com's nameservers which uses 1 hour ttl for all records.

That's a possibility, but when I checked @a.gtld-servers.net,

;; ANSWER SECTION:
vix.net. 2D IN NS ns1.pingmagic.com.
vix.net. 2D IN NS ns2.pingmagic.com.

;; AUTHORITY SECTION:
vix.net. 2D IN NS ns1.pingmagic.com.
vix.net. 2D IN NS ns2.pingmagic.com.

;; ADDITIONAL SECTION:
ns1.pingmagic.com. 2D IN A 202.140.169.216
ns2.pingmagic.com. 2D IN A 143.89.51.48

So, A: 2 days
    ?, recursed: 2 days, 2nd set of servers
    direct: 1 hour, 3rd set of servers

I dont know about best practice but I dont see any reason why your ns records
should be any different from the rest of your zone for which use a value which
suits you and your need to make changes (if these are your network a/ptrs
something at least 24 hours would be fine).

But that's the "thinking locally, acting globally" we're talking about
in the earlier thread.

Having the NS records with explicitly smaller ttl wouldnt as i see it help as
any change in nameservers as made with the registry would take the time of the
registry plus the gtlds to become effective

Yes, and the registries would seem to be using 2 days. However, for our
domain(s) we get the same servers @a, just with longer NS times.

For another data point, I checked Randy's setup. After all, he was
the WG chair for quite awhile, so he'll have a clear preference.

Like Paul, different servers visible from the root. Unlike Paul,
much longer TTLs.

; <<>> DiG 8.3 <<>> @a.gtld-servers.net psg.net any
; (1 server found)
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52052
;; flags: qr rd; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 2
;; QUERY SECTION:
;; psg.net, type = ANY, class = IN

;; ANSWER SECTION:
psg.net. 2D IN NS dns1.yoho.com.
psg.net. 2D IN NS dns2.yoho.com.

;; AUTHORITY SECTION:
psg.net. 2D IN NS dns1.yoho.com.
psg.net. 2D IN NS dns2.yoho.com.

;; ADDITIONAL SECTION:
dns1.yoho.com. 2D IN A 64.239.77.100
dns2.yoho.com. 2D IN A 64.239.77.101

; <<>> DiG 8.3 <<>> @rain.psg.com psg.com any
; (1 server found)
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40632
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 5
;; QUERY SECTION:
;; psg.com, type = ANY, class = IN

;; ANSWER SECTION:
psg.com. 4H IN SOA rain.psg.com. hostmaster.psg.com. (
                                        200407121 ; serial
                                        1D ; refresh
                                        1H ; retry
                                        4w2d ; expiry
                                        4H ) ; minimum

psg.com. 4H IN NS ARIZONA.EDU.
psg.com. 4H IN NS DNS.LIBRARY.UCLA.EDU.
psg.com. 4H IN NS rain.psg.com.
psg.com. 4H IN A 147.28.0.62
psg.com. 4H IN MX 0 psg.com.
psg.com. 4H IN NAPTR 10 0 "s" "SIP+D2T" "" _sip._tcp.psg.com.
psg.com. 4H IN NAPTR 20 0 "s" "SIP+D2U" "" _sip._udp.psg.com.

;; ADDITIONAL SECTION:
rain.psg.com. 4H IN A 147.28.0.34
ARIZONA.EDU. 1d23h1m15s IN A 128.196.128.233
splat.psg.com. 4H IN A 147.28.0.39
_sip._tcp.psg.com. 4H IN SRV 0 0 5060 splat.psg.com.
_sip._udp.psg.com. 4H IN SRV 0 0 5060 splat.psg.com.

Uhh... why are you looking at vix.net and psg.net from the gtld servers, but vix.com and psg.com from their servers?

psg.com has the same servers at the GTLD delegation as in-zone.

John Payne wrote:

Uhh... why are you looking at vix.net and psg.net from the gtld
servers, but vix.com and psg.com from their servers?

psg.com has the same servers at the GTLD delegation as in-zone.

Aha! My fingers betrayed me. I'm so used to typing .net for network
guys. Whereas I cut and pasted the .com references for the zones from
their email addresses. Oops.