Everything else remaining equal...is there a standard or expectation for
DNS reliability?
98%
99%
99.5%
99.9%
99.99%
99.999%
Measured in queries completed vs. queries lost.
Whats the consensus?
Everything else remaining equal...is there a standard or expectation for
DNS reliability?
98%
99%
99.5%
99.9%
99.99%
99.999%
Measured in queries completed vs. queries lost.
Whats the consensus?
To me anything below 99.99% is unacceptable.
100 failures out of 100,000 queries still seems like a lot especially if
its not network related.
So I would say 99.999% would be what I would look for.
Thanks
I go with 99.999% given that you have a good number of DNS Servers
(anycasted).
Its a good point about the anycast; 99.999% should be expected.
ICANN new gTLD agreements specified 100% availability for the service,
meaning at least 2 DNS IP addresses answered 95% of requests within 500 ms
(UDP) or 1500 ms (TCP) for 51+% of the probes, or 99% availability for a
single name server, defined as 1 DNS IP address.
Rubens
Remember though that anycast only solves for availability in one layer of
the system and it is not difficult to create a less available anycast
presence if you do silly things with the way you manage your routes. A
system is only as available as the least available layer in that system
For example, if you use an automated system that changes your route
advertisements and that system encounters a defect that breaks your
announcements then although a well built anycast footprint might acheive
99.999, a poorly implemented management system that is less available and
creates an outage would reduce the number.
Good reference; thank you.
Thumbs up on this one; my entire path and chain of management of that path
need to be equally fault tolerant - Awesome.
Everything else remaining equal...is there a standard or expectation for
DNS reliability?
...
Measured in queries completed vs. queries lost.
this is the wrong question. the protocol is designed assuming query
failures.
randy
I think it's part of the right answer. Capacity and server connectivity issues, what this metric will mostly measure, do matter.
The other part, more likely to get you on CNN and Reddit and the front pages of the NY Times and WSJ, is the area represented by MTBF / MTTR / etc. how often is DNS for your domain DOWN - or WRONG - and how fast did you recover.
The other subthread about routeability plays into that. For BIGPLACE environments, you should be considering how many AS numbers independently host DNS instances for you, in how many geographical regions, and do you have a backup registrar available spun up...
-george william herbert
we're already outside our operating envelope, if these community
expectation figures are believable. a wise man once said to me that when
setting formal conformance targets its a good idea to only set ones you can
honestly achieve, otherwise you're setting yourself up to be measured to
fail. I don't think that necessarily competes with 'aim high' ('be all you
can be') but...
we're already outside our operating envelope
not really. just some folk seem not to understand things such as udp
datagrams and the dns protocols.
randy
Statistically, UDP sometimes arrives after an internet wide round trip. Honest!
The worry is bimodal.
Most small sites, two or three servers, stop worrying.
Most medium sites, watch your server load and run external monitoring.
Most big sites are not sufficiently paranoid / redundant here.
-george william herbert
you removed a clause in that sentence randy:
"we're already outside our operating envelope, if these community
expectation figures are believable"
there is a point to that clause. its the same as your answer in some
respects.
Remember to factor in Duane Wessel's work that showed that something
like 98% of the DNS traffic at the root servers was totally bogus?
Maybe you need to factor in "broken queries not answered, and offenders slapped
around with a large trout"? Because if it's busted requests you're sending
towards the root, they're going to count against your completed/lost ratio in a
really bad way.
Anybody know if people have cleaned up their collective acts since Duane
did that paper?
A small choice of attitude-reflecting language.
I expect 100.000%
I'll accept 99.999% or better.
here's an interesting point... if you are a BIGPLACE, do you want to
trust your fate to some third party hosting your dns for you? What
about how your internal name service stuff is managed?
say you have a practice of using rsh to affect updates across your 4
main dns nodes, adding a 5th or Nth outside where rsh is not
possible/desired .... means adding additional processes and cruft to
your update process, is this acceptable?
Take, for instance the FBI.gov domain 3 days ago, some set of updates
happened, their ipv4 servers were answering with a consistent
response, their ipv6 nodes were answering with a variety of not
correct answers In the case of the FBI.gov domain, all of it is
handled outside 'fbi.gov hands' (all servers hosted externally) but...
-chris
unless phil happens to be building out (or spec'ing out $provider's
offered sla) for one of the happy thousand or so celebrants of 2014, a
surprisingly large fraction of which are tenant plays on existing
infrastructure, the bogie above, uninterpreted, is not a controlling
authority.
additionally, was phil asking for a metric for an authoritative
server, serving a zone delegated directly from the iana root? was he
asking for a metric for a caching server?
and if the metric is "queries completed vs. queries lost", from where
to where? (that is the "uninterpreted" bit from the bogie rubens
quotes, as we did have to correct some assumptions of the requirement
author -- where is the measurement being preformed?
i'm with randy on this, dns is a service, the better question is what
fails as query response degrades, in the presence of hierarchical
caching and the protocol being used as designed under best effort of
infrastructure and application.
eric
It depends... define 'lost queries'. For example; is RRL included here
or not (sometimes you want to deliberatly 'loose' queries).
I do not ever set any amount of failure as an objective. I usually have a specified tolerance for failure. If for some odd circumstance I wan to discard queries, that would involve knowing exactly what happened to them--not loosing them.