ultradns reachability

Matt_Ghali1 · July 2, 2004, 1:01am

is anyone else seeing timeouts reaching ultradns' .org nameservers?

I'm seeing seemingly random timeout failures from both sbci and uc berkeley.

Christopher_X_Candre · July 2, 2004, 1:10am

I'm seeing random DNS failures in general -- in the last half hour I had
failures on isc.org, slashdot.org, digikey.com, atmel.com

-Chris

Chris_Adams3 · July 2, 2004, 1:11am

Once upon a time, Matt Ghali <mghali@gmail.com> said:

is anyone else seeing timeouts reaching ultradns' .org nameservers?

I'm seeing seemingly random timeout failures from both sbci and uc berkeley.

One is working and one is not from here.

$ dig +norec @tld1.ultradns.net whoareyou.ultradns.net in a

; <<>> DiG 8.4 <<>> +norec @tld1.ultradns.net whoareyou.ultradns.net in a
; (1 server found)
;; res options: init defnam dnsrch
;; res_nsend: Connection timed out
$ dig +norec @tld2.ultradns.net whoareyou.ultradns.net in a

; <<>> DiG 8.4 <<>> +norec @tld2.ultradns.net whoareyou.ultradns.net in a
; (1 server found)
;; res options: init defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33271
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;; whoareyou.ultradns.net, type = A, class = IN

;; ANSWER SECTION:
whoareyou.ultradns.net. 0S IN A 204.74.105.6

;; Total query time: 403 msec
;; FROM: ant.hiwaay.net to SERVER: 204.74.113.1
;; WHEN: Thu Jul 1 20:10:28 2004
;; MSG SIZE sent: 40 rcvd: 56
$

Eric_Frazier · July 2, 2004, 1:18am

Yes, it looks like it is starting to get back to normal since I got your email

As far as I could tell it started around 5:30 PST and ended around 6:00 PST.

Thanks,

Eric

James_Edwards · July 2, 2004, 1:45am

http://www.cymru.com/DNS/gtlddns-o.html

Christopher_L_Morro1 · July 2, 2004, 2:06am

my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from
this monitoring point of view. It'd be nice if the aforementioned
'what/where/who' info was available for each monitoring point CYMRU
uses... So you could tell that from the SBC POV you were querying the XO
westcoast pod, from the APPS POV you saw the Verio CHI pod and from the
AT&T POV you saw the ATT local pod.

Anycast makes the pinpointing of problems a little challenging from the
external perspective it seems to me.

k_claffy · July 2, 2004, 3:35am

my mrtg skillz are kind of lame, but this seems to show 2/3rds outage from
  this monitoring point of view. It'd be nice if the aforementioned
  'what/where/who' info was available for each monitoring point CYMRU
  uses... So you could tell that from the SBC POV you were querying the XO
  westcoast pod, from the APPS POV you saw the Verio CHI pod and from the
  AT&T POV you saw the ATT local pod.

  Anycast makes the pinpointing of problems a little challenging from the
  external perspective it seems to me.

i am relieved it is only 'a little challenging'
because i was worried it was 'sub-possible'.
(or am i misinterpreting operational euphemisms...)

if we use the routing system to hide reality,
we give up transparency in exchange for vigor.
it's unclear to me that we even know how to quantify
much less measure that tradeoff. like so many other
complexity tradeoffs..
but then we've taken similar risks before and gotten
stuff like BGP so maybe we'll be, um, just as fond of
anycast in due time.

k
//
they call this war a cloud over the land. but they made the
weather and then they stand in the rain and say 'sh*t its raining!'.
-- renéezellweger, 'cold mountain'
//

Christopher_L_Morro1 · July 2, 2004, 4:18am

  > http://www.cymru.com/DNS/gtlddns-o.html
  >
  Anycast makes the pinpointing of problems a little challenging from the
  external perspective it seems to me.

i am relieved it is only 'a little challenging'
because i was worried it was 'sub-possible'.
(or am i misinterpreting operational euphemisms...)

Oops, I did it again, I forgot the ":)".

So, I thought of it like this:
1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of
the 2 .org TLD boxes are, what network pieces they are connected to at
which bandwidths and the current utilization
2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
could be multiple inside each pod, right?) are running their dns process
and answering at which rates
3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
pushing requests to said system inside the pod
4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
systmes responding inside the local pod) so they can stop routing the /24
from that pod's location

So, Rodney/Centergate/UltraDNS should know almost exactly when they have a
problem they can term 'critical'... I most probably left out some steps
above, like wedged proceseses or loss of outbound routing to prefixes
sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure
modes for anycast DNS services.

The problem then becomes the "Hey, .org is dead!" From where is it dead?
What pod are you seeing it dead from? Is it routing TO the pod from you?
FROM the pod to you? The pod itself? Stuck/stale routing information
somewhere on the path(s)? This is very complex, or seems to be to me

A good thing, oddly enough, is each of these events gives everyone more
and better information about the failure modes

but then we've taken similar risks before and gotten
stuff like BGP so maybe we'll be, um, just as fond of
anycast in due time.

I think more failure modes will be investigated before that comes
fortunately lots of people are already investigating these, eh?

-Chris

E.B_Dreger · July 2, 2004, 5:22am

Date: Fri, 02 Jul 2004 04:18:07 +0000 (GMT)
From: Christopher L. Morrow

[ editted for brevity -- some punctuation/wording modified ]

So, I thought of it like this. Rodney/Centergate/UltraDNS
knows:

[ snip enumeration ]

[and] should know almost exactly when they have a problem
they can term 'critical'...

One essentially has a DNS network on top of IP network. Looks
like O(N) with centralized monitoring, although it could approach
O(N^2) if each server/pod cross-monitored all the others.

The problem then becomes the "Hey, .org is dead!" From where
is it dead? What pod are you seeing it dead from? Is it
routing TO the pod from you? FROM the pod to you? The pod
itself? Stuck/stale routing information somewhere on the
path(s)? This is very complex, or seems to be to me

I find your perception of complexity ironic. Yes, there's a good
deal of splay. However, I suspect a network the size of UU also
has a fair amount of peering splay, with a couple downstreams
thrown in for good measure.

However, I agree anycast has additional design implications:

* Should servers/pods talk among themselves using mcast along
  pairs that follow L3 topology? Should N servers/pods each
  communicate with (N / 2 + 1) others, ignoring L3 topology?
  Fast poll the former and slow poll the latter?

* If servers/pods communicate among themselves, should they use
unicast addresses? anycast addresses? anycast addresses
tunneled through unicast?

* Each pod a stub? Each pod interconnected with an OOB OAM
network? All pods interconnected with sizable backbone? Does
multicast serve a purpose?

Eddy

Joe_Abley3 · July 2, 2004, 2:22pm

So, I thought of it like this:
1) Rodney/Centergate/UltraDNS knows where all their 35000billion copies of
the 2 .org TLD boxes are, what network pieces they are connected to at
which bandwidths and the current utilization
2) Rodney/Centergate/UltraDNS knows which boxes in each location (there
could be multiple inside each pod, right?) are running their dns process
and answering at which rates
3) Rodney/Centergate/UltraDNS knows when processes die and locally stop
pushing requests to said system inside the pod
4) Rodney/Centergate/UltraDNS knows when a pod is completely down (no
systmes responding inside the local pod) so they can stop routing the /24
from that pod's location

So, Rodney/Centergate/UltraDNS should know almost exactly when they have a
problem they can term 'critical'... I most probably left out some steps
above, like wedged proceseses or loss of outbound routing to prefixes
sending reqeusts. I'm sure Paul/ISC has a fairly complete list of failure
modes for anycast DNS services.

All the failure modes that ISC has seen with anycast nameserver instances can be avoided (for the authoritative DNS service as a whole) by including one or more non-anycast nameservers in the NS set.

This leaves the anycast servers providing all the optimisation that they are good for (local nameserver in toplogically distant networks; distributed DDoS traffic sink; reduced transaction RTT) and provides a fall-back in case of effective reachability problems for the anycast nameservers.

This is so trivial, I continue to be amazed that PIR hasn't done it.

The problem then becomes the "Hey, .org is dead!" From where is it dead?
What pod are you seeing it dead from? Is it routing TO the pod from you?
FROM the pod to you? The pod itself? Stuck/stale routing information
somewhere on the path(s)? This is very complex, or seems to be to me

With the fix above, the problem becomes "hey, *some* of the nameservers for ORG are dead! We should fix that, but since not *all* of them are dead, at least ORG still works."

I think more failure modes will be investigated before that comes
fortunately lots of people are already investigating these, eh?

I don't know about lots, but I know of a few. None of the people I know of are using an entire production TLD as their test-bed, however.

Joe

Leo_Bicknell1 · July 2, 2004, 2:43pm

In a message written on Fri, Jul 02, 2004 at 10:22:09AM -0400, Joe Abley wrote:

This leaves the anycast servers providing all the optimisation that
they are good for (local nameserver in toplogically distant networks;
distributed DDoS traffic sink; reduced transaction RTT) and provides a
fall-back in case of effective reachability problems for the anycast
nameservers.

This is so trivial, I continue to be amazed that PIR hasn't done it.

I talked to Rodney about this a long time ago, as well as a few
other people. What in practice seems simple is complicated by some
of the software that is out there. See:

http://www.nanog.org/mtg-0310/pdf/wessels.pdf

Note in the later pages what happens to particular servers under
packet loss. They all start to show an affinity for a subset of
the servers. It's been said that by putting some non-anycasted
servers in with the anycasted servers what can happen is if the
anycast has issues many things will "latch on" to the non-anycasted
servers and not go back even when the anycast is fixed.

How serious this is for something like .org I have no idea, but it's
clear all the software has issues, and until they are fixed I don't
think this is just a slam dunk.

Dr_Jeffrey_Race · July 2, 2004, 3:05pm

Sorry, I missed the top of this thread. I cannot mail an ORG correspondent since
ORG domain lookups fail. What is happening and is there a workaround?

Thanks for any help

Jeffrey Race

Joe_Abley3 · July 2, 2004, 3:16pm

In my opinion, the primary purpose of anycast distribution of nameservers is reliability of the service as a whole, and not performance. Being able to reach a server is much more important than whether you can get a reply from a particular server in 10ms or 500ms.

So, I think the issue you mention (which is certainly mention-worthy) is a much smaller problem than the apparently observed problem of all nameservers in the NS set being unavailable.

Joe

Leo_Bicknell1 · July 2, 2004, 3:27pm

In a message written on Fri, Jul 02, 2004 at 11:16:08AM -0400, Joe Abley wrote:

In my opinion, the primary purpose of anycast distribution of
nameservers is reliability of the service as a whole, and not
performance. Being able to reach a server is much more important than
whether you can get a reply from a particular server in 10ms or 500ms.

Well, you're right, but there's a practical matter to scaling the
deployment.

If you have 50 anycast servers, each doing 1 unit of work, and you
list the anycast address 1 one non-anycasted address, there's a
real possibility that the vast majority of clients out there will
latch on to that one address, sending all 50 units of load towards
it. So the question is not so much "is 500ms towards the server
bad", it's "can I build a single server (cluster) that will take
all the load worldwide when the client software does bad things."

Of course, everyone I've ever seen talk about this is either
referencing a lab test, or theory based on how the code works. I've
seen very little real world measurements to show how this actually
plays in the wild. If someone has anycast + unicast for a "busy"
zone and can provide real distribution of queries data (particularly
before and after an outage) that would be quite interesting.

Stephen_J_Wilcox1 · July 2, 2004, 7:39pm

Am I missing something..

So you say:

10.1.0.1 Anycast (x50 boxes)
10.2.0.1 Non-anycast

is somehow different from

10.1.0.1 Anycast1 (x50 boxes)
10.2.0.1 Anycast2 (x50 boxes - different to anycast1)

In each scenario two systems have to fail to take out any one customer.. but
isnt the bottom one better for the usual pro anycast reasons?

Steve

_Bill_Woodcock · July 3, 2004, 1:41pm

Correct, and that's what's done whenever engineering triumphs over
marketing. The problem is that there's always a temptation to put
instances of both clouds at a single physical location, but that's
sabotaging yourself, since then the attack which takes down one will take
down the other as well.

With DNS, it really makes sense to do what you're suggesting, since DNS
has its own internal load-balancing function, and having two separate
clouds just means that you're giving both the anycast and the DNS client
load-balancing algorithms a chance to work. With pretty much any other
protocol (except peer-to-peer clients, which also mostly do client-side
load balancing) there's a big temptation to have a single huge cloud that
appears in as many places as possible.

-Bill