DNS caches that support partitioning ?

Are there DNS caches that allow you to partition the cache for
subtrees of DNS names? That is, you can say that all entries from
say, in-addr.arpa, are limited to 20% of the cache.

The application I have in mind is to see if it helps to keep DNSBL
traffic, which caches poorly, from pushing other stuff out of the
cache, but there are doubtless others.

R's,
John

If it's getting evicted from cache because other things are getting
used more often, why do you want to put your thumb on that scale? The
other queries are presumably benefitting just as much from the caching.

Best,

A

I think John's issue is that he's seeing those other queries *not* benefiting
from the caching because they get pushed out by DNSBL queries that will likely
not ever be used again. You don't want your cached entry for www.google.com
to get pushed out by a lookup for a dialup line somewhere in Africa.

If the dnsbl queries are not likely to be used again, why don't they
set their ttl way down?

In any case, DNSBL's use of DNS has always been a hack. If v6
causes the hack to blow up, they should create their own protocol
rather than ask how we can make the global DNS accommodate
their misuse of DNS.

Mike

Oh, yes, I see. You're right, I misread it. But the proposed
solution still seems wrong to me. If the entry for www.google.com
gets invalidated by a new cache candidate that is never going to get
used again, the cache is simply too small (or else it doesn't have
enough traffic, and you shouldn't have a cache there at all).

The cache needs to be big enough that it has a thrashy bit that is
getting changed all the time. Those are the records that go into the
cache and then die without being queried again. If the problem is
that there's some other record in there that might be queried again,
but that doesn't get queried often enough to keep it alive, then the
additional cost of the recursive lookup is just not that big a deal.

Best,

A

The cache needs to be big enough that it has a thrashy bit that is
getting changed all the time. Those are the records that go into the
cache and then die without being queried again. If the problem is
that there's some other record in there that might be queried again,
but that doesn't get queried often enough to keep it alive, then the
additional cost of the recursive lookup is just not that big a deal.

A big part of the problem is that TTLs reflect the server's estimate
of how long it might be until the data changes, but not the client's
likelihood of reusing the data. I publish a BL of hosts in Korea.
The data changes maybe once a month so from my point of view a large
TTL would make sense, but most of the hits are for bots spraying spam
at random, so the client's not likely to reuse the data even if it
sticks around for many hours.

A plausible alternative to a partitioned cache might be TTL capping,
some way to say to your cache that no matter what the TTL is on
data from a range of names, don't keep it for more than a few seconds.

Another thing that would be useful to me for figuring out the BL cache
stuff would be traces of incoming IP addresses and timestamps for mail
servers larger than my tiny system, so I could try out some cache models.

R's,
John

Hi!

The cache needs to be big enough that it has a thrashy bit that is
getting changed all the time. Those are the records that go into the
cache and then die without being queried again. If the problem is
that there's some other record in there that might be queried again,
but that doesn't get queried often enough to keep it alive, then the
additional cost of the recursive lookup is just not that big a deal.

A big part of the problem is that TTLs reflect the server's estimate
of how long it might be until the data changes, but not the client's
likelihood of reusing the data. I publish a BL of hosts in Korea.
The data changes maybe once a month so from my point of view a large
TTL would make sense, but most of the hits are for bots spraying spam
at random, so the client's not likely to reuse the data even if it
sticks around for many hours.

A plausible alternative to a partitioned cache might be TTL capping,
some way to say to your cache that no matter what the TTL is on
data from a range of names, don't keep it for more than a few seconds.

Another thing that would be useful to me for figuring out the BL cache
stuff would be traces of incoming IP addresses and timestamps for mail
servers larger than my tiny system, so I could try out some cache models.

About any caching nameserver supports forwarding. So if you REALLY are afraid this will bite you. Fire up another instance. Forward all the reverse DNS to that and you are set. That other instance will handle your reverse DNS with a cache that you will specify. The other ones are not impacted at all. When you use forwarding it doesnt cache the entry. ('forward only' option in bind for example).

So:

Caching resolver -> Caching resolver dedicated for reverse DNS.

Your first resolver does not cache reverse DNS and you are safe there. On the second you do the reverse DNS.

I talked with Paul Vixie about doing this internal inside bind but that was not something they would be delighted to do (at least not now). If you could define how large your cache pool was for certain objects that would fix it also.

Reverse DNS isnt the only issue here. There are many sites that give each user a subdomain. And if i look at my top talkers on some busy resolvers i do see that thats doing about 25-30% of the lookups currently.

akamai.net, amazonaws.com and so on. All make nice use of DNS for this.
Those have litterly millions of entry's in DNS also. And thats what currently is doing the load on resolvers...

Oh and dont forget DNSSEC. Traffic went up allready by a factor 7 due to that. Those objects are also much larger to store.

Bye,
Raymond.

If the dnsbl queries are not likely to be used again, why don't they
set their ttl way down?

Because the DNSBLs don't tune the TTLs for individual responses; they
likely still benefit from extended caching, high TTLs for responses
makes sense for controlling load on the DNSBL servers, and your
cache efficiency is not the DNSBL operators' problem. There are
/some/ IP addresses that generate a lot of mail; and I would suggest
there are /plenty/ of low-traffic recursive DNS servers in the world
that may still make DNSBL queries.

The /real/ problem to be addressed is the poor caching discipline
used by DNS recursive servers. A recursive server with
intelligent caching, would not simply allocate a chunk of heap space
and hold for TTL, and when assigned cache space runs out, evict the
oldest query, when space is short, eg. Least-Recently Hit cache
entry;LRU; Least-Recent RR Response; Least-Recently Queried; or
Lowest-Remaining TTL,
are all naive.

An intelligent recursive DNS caching system would leverage both RAM
and Disk when appropriate, store the cache efficiently and
persistently, keep track of cache evictions that occured, and the
number of times a RR was requested would factor into not only
avoiding eviction of the cache, but for popular RR, it would make
sense for the recursive DNS server to make a pre-emptive query, to
refresh a RR that is about to expire due to TTL,

So that the response latency for some random time that RR is queried
in the future doesn't go up due to the cache entry expiring.

And I say that, because some very popular RRs have insanely low TTLs.

Case in point:
www.l.google.com. 300 IN A 74.125.227.148
www.l.google.com. 300 IN A 74.125.227.144
www.l.google.com. 300 IN A 74.125.227.146
www.l.google.com. 300 IN A 74.125.227.145
www.l.google.com. 300 IN A 74.125.227.147
www.l.google.com. 300 IN A 74.125.227.148

In any case, DNSBL's use of DNS has always been a hack. If v6
causes the hack to blow up, they should create their own protocol
rather than ask how we can make the global DNS accommodate
their misuse of DNS.

Mike

[snip]

Akamai has no "users". So not really sure what you mean by that.

There are a /lot/ of hostnames on *.akamai.net. That may have something to do with the 1000s of companies that use Akamai to deliver approximately 20% of all the traffic going down broadband modems. Which fits nicely in your DNS lookup percentage.

Different people have different points of view.

IMHO, if Google losses a datacenter and all users are stuck waiting for a long TTL to run out, that is Very Bad. In fact, I would call even 2.5 minutes (average of 5 min TTL) Very Bad. I'm impressed they are comfortable with a 300 second TTL.

You obviously feel differently. Feel free to set your TTL higher.

What Patrick said. For large sites that offer services in multiple data centers on multiple IPs that can individually fail at any time, 300 seconds is actually a bit on the long end.

-C

Which is why the DNS supports multiple address records. Clients
don't have to wait a minutes to fallover to a second address. One
doesn't have to point all the addresses returned to the closest
data center. One can get sub-second fail over in clients as HE
code shows.

As for the original problem. LRU replacement will keep "hot" items in
the cache unless it is seriously undersized.

Mark

Maybe. This discussion is reminiscent of the Linux swappiness debate.

Early in the 2.x series Linux kernels, the guy responsible for the
virtual memory manager changed it to allow the disk cache to push
program code and data out of ram if all other disk cache was more
recently touched than the program data. Previously, the disk cache
would only consume otherwise free memory. Programs would only get
pushed to swap by memory demands from other programs.

The users went ape. Suddenly if you copied a bunch of data from one
disk to another, your machine would be sluggish and choppy for minutes
or hours afterward as programs recovered swapped pages from disk and
ran just long enough to hit the next section needing to be recovered
from swap. Some folks ditched swap entirely to get around the problem.

The guy insisted the users were wrong. He had the benchmarks,
meticulously collected data and careful analysis to prove that the
machines were more efficient with pure LRU swap. The math said he was
right. 2+2=4. But it didn't.

In the very common case of copy-a-bunch-of-files, simple LRU
expiration of memory pages was the wrong answer. It caused the machine
to behave badly. More work was required until a tuned and weighted LRU
algorithm solved the problem.

Whether John's solution of limiting the cache by zone subtree is
useful or not, he started an interesting discussion. Consider, for
example, what happens when you ask for www.google.com. You get a 7-day
CNAME record for a 5 minute www.l.google.com A record and the resolver
gets 2-day NS records for ns1.google.com, 4 day A records for
ns1.google.com, 2 day NS records for a.gtld-servers.com, etc.

Those authority records don't get touched again until www.l.google.com
expires. With a hypothetically simple least recently used (LRU)
algorithm, the 4 minute old A record for ns1.google.com was last
touched longer ago than the 3 minute old A record for
5.6.7.8.rbl.antispam.com. So when the resolver needs more cache for
4.3.2.1.rbl.antispam.com, which record gets kicked?

Then, of course, when www.l.google.com expires after five minutes the
entire chain has to be refetched because ns1.google.com was already
LRU'd out of the cache. This is distinctly slower than just refetching
www.l.google.com from the already known address of ns1.google.com and
the user sees a slight pause at their web browser while it happens.

Would a smarter, weighted LRU algorithm work better here? Something
where rarely used leaf data doesn't tend to expire also rarely used
but much more important data from the lookup chain?

Regards,
Bill Herrin

[snip]
Well, that's the problem. Items that are not relatively "hot" will
be purged, even though they may be very popular RRs. Cache
efficiency is not defined as "keeping the hot items".

Efficient caching is defined as maximizing the hit percentage.
The DNS cache may have a load of DNSBL queries; so all the entries
will be cold. The problem in that case is not low utilization, it's
high utilization of queries that are useless to cache, because
those questions will only be asked once, because in LRU there's no
buffer maintaining an eviction history.

An example alternative strategy is, you have a Cache size of XX RR
buckets, and you keep a list of YY cache replacements (not
necessarily the entire RRs, just label and a 1-byte count of
evictions), and you have a cache policy of: pick the entry whose TTL
has expired OR that has the lowest eviction count, that is
least-recently used or
has the least number of queries, to replace,.

Re: LRU badness

One approach is called adaptive replacement cache (ARC) which is
used by Oracle/Sun in ZFS, and was used in PostgreSQL for a time
(and slightly modified to (as I recall) to be more like 2Q due to
concerns over the IBM patent on the algorithm). Unfortunately,
we do not have any implementations of the OPT (aka clairvoyant)
algorithm, so something like 2Q might be an interesting approach
to experiment with.

Gary

While I hesitate to argue DNS with Mark, I feel this needs a response.

What Patrick said. For large sites that offer services in multiple data =
centers on multiple IPs that can individually fail at any time, 300 =
seconds is actually a bit on the long end.

Which is why the DNS supports multiple address records. Clients
don't have to wait a minutes to fallover to a second address. One
doesn't have to point all the addresses returned to the closest
data center. One can get sub-second fail over in clients as HE
code shows.

I'm afraid I am not familiar with "HE code", so perhaps I am being silly here. But I do not think returning multiple A records for multiple datacenters is as useful as lowering the TTL. Just a few reasons off the top of my head:

* How do you guarantee the user goes to the closer location if you respond
  with multiple addresses? Forcing users to go to farther away datacenters
  half the time is likely a poor trade-off for the occasional TTL problem
  when a DC goes down.

* How many applications are even aware multiple addresses were returned?

* How do you guarantee sub-second failover when most apps will wait longer
  than one second to see if an address responds?

Etc.

And that doesn't begin to touch thing such as cache efficiency that affect companies like Google, CDNs, etc.

As for the original problem. LRU replacement will keep "hot" items in
the cache unless it is seriously undersized.

This was covered well by others.

Some folks do this via various GSLB mechanisms which selectively respond with different records based on the assumed relative topological distance between the querying resolver and various server/service instantiations in different locations.

"Some folks" == "more than half of all traffic on broadband modems" these days.

However, I think you missed a post or two in this thread.

The original point was you need a low TTL to respond with a single A record or multiple A records which all point to the same datacenter in case that node / DC goes down. Mark replied saying you can respond with multiple A records pointing at multiple DCs, thereby allowing a much longer TTL.

My question above is asking Mark how you guarantee the user/application selects the A record closest to them and only use the other A record when the closer one is unavailable.

Mark is referring to "happy eyeballs":
http://www.isc.org/community/blog/201101/how-to-connect-to-a-multi-homed-server-over-tcp

Tony.

When you use forwarding it doesnt cache the entry. ('forward only'
option in bind for example).

That's incorrect. Try configuring a forwarded zone and observe the TTLs
you get in responses. The "forward only" option disables recursion but
not cacheing.

I talked with Paul Vixie about doing this internal inside bind but that was
not something they would be delighted to do (at least not now). If you could
define how large your cache pool was for certain objects that would fix it
also.

You might be able to hack it using a combination of forwarding zones,
views, max-cache-ttl, and attach-cache.

Tony.