DNS cache Validation

What are you folk doing to validate your DNS cache server configs and operation? In other words, what are you doing to make sure they are performing well, not just alive.


I wrote a script to expose stats from unbound to SNMP and built a Cacti template for that. Recently started moving the DNS stats to feed into Telegraf that pushes to an InfluxDB server, then built a dashboard in Grafana. We track DNS RTT for a few queries, number of drops, number of rejects, various record type requests per second, etc. We also have a Nagios plugin that checks each of our DNS cache resolving servers scattered across the network to ensure they can resolve a handful of popular domains.

There are various things you can do. With resolvers like BIND's named,
you have a command called "rndc stats" that dumps statistics counters to
a file. This contains a variety of statistics that can do with
monitoring. I'll list some things for BIND's named, but you can probably
do something similar for other products too:

(1) Check the size of your cache and ensure that it is not too small and
that it is bound. The max-cache-size config option will limit it. In old
versions of named, there was no limit. Current versions have an
automatic limit. You want this size to be at least a few hundred MB for
a small LAN and larger if it is a widely used resolver. Check the "cache
records deleted due to memory exhaustion" counter in rndc stats output.

(2) Check your cache hit rate (CHR). This is the number of queries that
were answered from cache vs. number of overall client queries. You
should be able to compute this from rndc stats output. It can usually be
anything between 50% to 95% depending on the usage, but if you see it
dipping below 50% and this is an ordinary resolver, you may want to look
into why that is so. CHR is typically graphed and monitored that way.

(3) Check the number of outstanding queries that the resolver is
performing. This should not be very high (the CHR influences it, but
other factors can cause this to go high too). "rndc recursing" dumps the
list of the clients that are waiting on recursion to finish (because the
cache didn't have an answer for them; note that many clients waiting for
the same question doesn't mean the resolver makes as many queries to
upstream authorities). The "recursive-clients" named.conf option is
related. The rndc recursing clients dump also contains a timestamp of
seconds since epoch, and it lists IPv4 and IPv6 clients in sequence of
arrival. The first timestamp of IPv4 or IPv6 client should not be very
far off from current time. (It can be due to various issues).

(4) Check the resolver and socket I/O counters in the rndc stats dump.
Check the "NNNN queries caused recursion", "NNNN queries caused
recursion", "NNNN recursing clients", "NNNN UDP queries in progress",
"NNNN active fetches", etc. They have other identifiers in named's XML

Check what number of UDP recv and send errors are happening. Keep an eye
on querylog if you have them for attack patterns (random subdomain style
attacks are common, but there are also some other attack patterns which
I won't mention). Mitigation may involve contacting your DNS vendor.
Keep an eye on response sizes and amplification attacks if you're
running a publicly reachable service.

With very low TTLs to highly popular questions's answers (names not
typically used by humans but by apps), some mitigations don't work very
well. If you are at this scale, you will likely be able to contact your
DNS product's support for help.

A resolver needs to maximize its CHR so it does very little work and
serves its purpose of being a cache. A resolver also has to be a good
internet citizen and not query upstream nameservers excessively, or
upstream namservers will sometimes drop queries.

The above are examples of what to monitor which happen outside the
normal, where normal includes things like failures to contact
nameservers, DNSSEC validation failures, etc. A resolver is a hugely
complex process, routinely with software bugs, and attacks happening if
it is public facing. What is documented may be different from what
actually happens. Your question is a good one, and it is good to monitor
a highly used resolver.


In current versions of BIND, my understanding is that the default value
is 90% of the detected physical memory, in case nothing specific is
defined by the operator.