.ORG problems this evening

Todd_Vierling · September 18, 2003, 4:50am

tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
an hour or two this evening, timing out on all DNS queries. Anyone else see
similar? (The hosts are unpingable and untracerouteable, so I had to use
DNS queries to determine when they were back up.)

It makes me wonder how UltraDNS got a contract to manage the domain on all
of two nameservers hosted on the same subnet, given that they were supposed
to have deployed "geographically diverse" (or something like that) servers.
But then, we know ICANN smokes the crack liberally at times....

<sigh>

Jared_Mauch · September 18, 2003, 5:22am

tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
an hour or two this evening, timing out on all DNS queries. Anyone else see
similar? (The hosts are unpingable and untracerouteable, so I had to use
DNS queries to determine when they were back up.)

It makes me wonder how UltraDNS got a contract to manage the domain on all
of two nameservers hosted on the same subnet, given that they were supposed
to have deployed "geographically diverse" (or something like that) servers.
But then, we know ICANN smokes the crack liberally at times....

dare i say "duh",

but ...

ultradns uses the power of anycast to have these ips that appear
to be on close subnets in geographyically diverse locations.

go to europe, traceroute to them, it goes to a place
in europe.

go to asia, traceroute to them, it goes to a machine in
asia.

in the us, it goes to one of a few geographical locations ...

could you provide some more technical details, other than
your postulations that they have two machines on
network-wise close subnets and that is the problem?

- jared

E.B_Dreger · September 18, 2003, 5:27am

Date: Thu, 18 Sep 2003 00:50:28 -0400 (EDT)
From: Todd Vierling

tld[12].ultradns.net, the NS for .ORG, was completely
unreachable for about an hour or two this evening, timing out
on all DNS queries. Anyone else see similar? (The hosts are

I don't recall having troubles this evening. Perhaps there was a
DoS or something pounding the anycast node you were hitting?
With multiple sinkholes, it's no longer "all or nothing".

Anycast is good stuff, IMHO, but not impervious to flooding.

Eddy

Christopher_L_Morrow · September 18, 2003, 5:28am

Just because they hosts are on the same subnet and are apparently behind
the same end device for you doesn't make them non-geographically diverse
if they are really anycast pods, does it? It really just means one anycast
pod was down for a time

It is one of the things that anycast makes difficult though
Troubleshooting anycast from the outside is a bear.

Christopher_L_Morrow · September 18, 2003, 5:30am

Oh, and 'same subnet' doesn't mean 'same ethernet' all auth dns servers in
198.6.1.0/24 aren't on one ethernet, though it'd sure make MY life easier
if they were

E.B_Dreger · September 18, 2003, 5:44am

Date: Thu, 18 Sep 2003 05:28:05 +0000 (GMT)
From: Christopher L. Morrow

Just because they hosts are on the same subnet and are
apparently behind the same end device for you doesn't make
them non-geographically diverse if they are really anycast
pods, does it? It really just means one anycast pod was down
for a time

Ideally, though, an anycast node should yank the route if the
service in question dies. I say "ideally" because we still
haven't had DNS properly make friends with BGP... and such flaps
really shouldn't be seen, which means having a contiguous
internal network and [properly] decoupling IGP from EGP...

...and suddenly I'm making many assumptions.

It is one of the things that anycast makes difficult though
Troubleshooting anycast from the outside is a bear.

It's a lot like multihoming, only with different geography.
Unicast IP addresses are analogous to world-facing router
interfaces.

Tip for anyone considering playing with anycast, particularly on
the same ethernet segment: Bind the anycast IP addresses to your
loopback interface.

Eddy

Rodney_Joffe1 · September 18, 2003, 5:47am

Todd Vierling wrote:

tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
an hour or two this evening, timing out on all DNS queries. Anyone else see
similar? (The hosts are unpingable and untracerouteable, so I had to use
DNS queries to determine when they were back up.)

At any given moment, UltraDNS (and I am sure other root and tld servers)
are under attack somewhere from someone. Additionally the monitors that
test each of the anycast nodes reported no outages. Neither did the
useful monitors that Rob Thomas runs at
(http://www.cymru.com/DNS/gtlddns-o.html) Nor did the many "helpful"
customers who use UltraDNS, and who run constant tests to each
individual anycast node in search of an SLA event that may provide a
service credit.

Perhaps you had a network problem internally?

It makes me wonder how UltraDNS got a contract to manage the domain on all
of two nameservers hosted on the same subnet, given that they were supposed
to have deployed "geographically diverse" (or something like that) servers.

Fortunately ICANN and the other decision makers were actually network
clueful, and could tell that 204.74.112.1 and 204.74.113.1 are actually
different subnets

As an aside, using ping or traceroute at *any* time to see if dns
servers are working is not a great idea.

John_Brown · September 18, 2003, 5:58am

um, dude, can you say ANYCAST.

Alex_Bligh1 · September 18, 2003, 8:45am

Todd/Chris,

It makes me wonder how UltraDNS got a contract to manage the domain on
all of two nameservers hosted on the same subnet, given that they were
supposed to have deployed "geographically diverse" (or something like
that) servers. But then, we know ICANN smokes the crack liberally at
times....

Just because they hosts are on the same subnet and are apparently behind
the same end device for you doesn't make them non-geographically diverse
if they are really anycast pods, does it? It really just means one anycast
pod was down for a time

Not even that:

(a) being on the same /24 doesn't make them on the same subnet. We
have this thing called CIDR nowadays. Hell, being on the
same /32 doesn't make them on the same (physical) subnet with anycast.

(b) they aren't on the same /24 anyway. Specs needed.
Compare:

  amb@shed:~$ host tld1.ultradns.net
  tld1.ultradns.net has address 204.74.112.1
                                                ^===*****
  amb@shed:~$ host tld2.ultradns.net
  tld2.ultradns.net has address 204.74.113.1
                                                ^===*****

A quick inspection from your favourite looking glass will show that
there is a /23 and a /24 separately announced. Hence:

(c) at least from here, the routes are pretty diverse in any case
    (see below). IE not only is (say) a west coast host served by different
    servers than a European host, but the two European hosts are served by
    different servers.

Alex

amb@shed:~$ traceroute tld1.ultradns.net
traceroute to tld1.ultradns.net (204.74.112.1), 30 hops max, 38 byte packets
1 195.82.114.1 (195.82.114.1) 0.917 ms 4.344 ms 6.131 ms
2 bdr1.lon-th1.mailbox.net.uk (195.82.97.226) 10.080 ms 4.050 ms 1.694 ms
3 195.82.96.70 (195.82.96.70) 1.552 ms 1.553 ms 2.138 ms
4 ge-1-3-0.r01.londen03.uk.bb.verio.net (217.79.161.10) 1.694 ms 11.200 ms 4.134 ms
5 ge-1-2.a01.londen03.uk.ra.verio.net (213.130.47.83) 150.565 ms 148.296 ms 199.156 ms
6 UltraDNS-0.a01.londen03.uk.ra.verio.net (213.130.48.38) 14.549 ms 8.962 ms 22.128 ms
7 dellfwabld.ultradns.net (204.74.106.2) 21.371 ms !H 27.196 ms !H 31.775 ms !H
amb@shed:~$ traceroute tld2.ultradns.net
traceroute to tld2.ultradns.net (204.74.113.1), 30 hops max, 38 byte packets
1 195.82.114.1 (195.82.114.1) 1.611 ms 3.728 ms 3.501 ms
2 bdr1.lon-th1.mailbox.net.uk (195.82.97.226) 8.889 ms 1.704 ms 3.333 ms
3 if-4-1-0.bb2.London.Teleglobe.net (195.219.2.1) 57.452 ms 140.352 ms 208.421 ms
4 if-6-0.core1.London.Teleglobe.net (195.219.96.82) 2.463 ms 7.007 ms 7.525 ms
5 if-1-0.core2.NewYork.Teleglobe.net (207.45.220.37) 82.125 ms 73.374 ms 92.908 ms
6 if-4-0.bb8.NewYork.Teleglobe.net (66.110.8.130) 76.796 ms 73.816 ms 77.613 ms
7 p3-3.IR1.NYC-NY.us.xo.net (206.111.13.13) 93.642 ms 83.092 ms 102.477 ms
8 p5-1-0-2.RAR2.NYC-NY.us.xo.net (65.106.3.65) 72.338 ms 72.658 ms 71.785 ms
9 p6-0-0.RAR1.Washington-DC.us.xo.net (65.106.0.2) 117.226 ms 76.936 ms 78.430 ms
10 p6-1-0.MAR1.Washington-DC.us.xo.net (65.106.3.182) 77.206 ms 84.026 ms 77.201 ms
11 p0-0.CHR1.Washington-DC.us.xo.net (207.88.87.10) 85.094 ms 77.777 ms 77.614 ms
12 64.124.112.141.ultradns.com (64.124.112.141) 85.226 ms 77.912 ms 78.143 ms
13 dellfwpxvn.ultradns.net (204.74.104.2) 77.940 ms !H 78.441 ms !H 97.738 ms !H

Majdi_S_Abbas1 · September 18, 2003, 9:08am

tld[12].ultradns.net, the NS for .ORG, was completely unreachable for about
an hour or two this evening, timing out on all DNS queries. Anyone else see
similar? (The hosts are unpingable and untracerouteable, so I had to use
DNS queries to determine when they were back up.)

I didn't have a problem with .org this evening, and I've asked
around and others don't seem to have noticed anything either. It would be
more helpful if you told us your source prefix, and which filter you're
hitting when you traceroute to tld[12].ultradns.net.

As far as the hosts themselves being filtered, I don't know of
any responsible TLD or root server operator that doesn't filter and/or
rate limit certain types of traffic to their servers -- you have to
understand the incredible volume of garbage they receive from both DoS
attacks and misconfigured or merely broken resolvers out there.

It makes me wonder how UltraDNS got a contract to manage the domain on all
of two nameservers hosted on the same subnet, given that they were supposed
to have deployed "geographically diverse" (or something like that) servers.

They're not on the same subnet:

  tld1.ultradns.net has address 204.74.112.1
  tld2.ultradns.net has address 204.74.113.1
                                         ^

But even if they were, there is a neat trick that some people
(waves to Paul, Rodney, and others) are doing with their DNS servers:
They advertise the same prefix to multiple networks in multiple
locations, and each location (hopefully) attracts traffic from nearby
sources -- when it works, it provides faster query responses, distributes
load, and some redundancy. In my experience it usually works pretty well.
This is known as anycast.

Both of these traceroutes are to 204.74.112.1:

traceroute to tld1.ultradns.net (204.74.112.1), 30 hops max, 38 byte packets
1 nnn-7202-fe-0-0-1 (204.42.254.1) 0.515 ms 0.456 ms 0.346 ms
2 d1-0-3-0-21.a00.anarmi01.us.ra.verio.net (209.69.3.33) 6.645 ms 6.678
ms 15.549 ms
3 d3-1-3-0.r01.chcgil01.us.bb.verio.net (129.250.16.22) 15.508 ms 17.321
ms 15.532 ms
4 p16-2-0-0.r01.chcgil06.us.bb.verio.net (129.250.5.70) 14.831 ms 14.712
ms 15.589 ms
5 ge-1-1.a00.chcgil07.us.ra.verio.net (129.250.25.167) 15.397 ms 17.021
ms 15.515 ms
6 fa-2-1.a00.chcgil07.us.ce.verio.net (128.242.186.134) 20.086 ms 16.286
ms 15.528 ms
7 dellfweqch.ultradns.net (204.74.102.2) 15.559 ms !H 14.908 ms !H
21.551 ms !H

Type escape sequence to abort.
Tracing the route to tld1.ultradns.net (204.74.112.1)
  1 cernh4.cern.ch (192.65.185.4) 0 msec 0 msec 0 msec
  2 ar3-chicago-stm4.cern.ch (192.65.184.25) 120 msec 120 msec 120 msec
  3 ar1-chicago-ge0.cern.ch (192.65.184.226) 120 msec 120 msec 124 msec
  4 NYC-gw14.NYC.US.net.DTAG.DE (62.156.138.190) [AS 3320] 116 msec 120 msec
116 msec
  5 LINX-gw13.LON.GB.NET.DTAG.DE (62.154.5.38) [AS 3320] 116 msec 116 msec
116 msec
  6 62.156.138.10 [AS 3320] 116 msec 116 msec 116 msec
  7 ge-1-1.a01.londen03.uk.ra.verio.net (213.130.47.67) [AS 2914] 116 msec
116 msec 116 msec
  8 UltraDNS-0.a01.londen03.uk.ra.verio.net (213.130.48.38) [AS 2914] 116
msec 116 msec 120 msec
  9 dellfwabld.ultradns.net (204.74.106.2) [AS 12008] !H !H !H

But clearly tld1.ultradns.net, were it a single host, could
not reside in both London and Chicago. If you try your traceroutes from
several different networks around the world (try http://www.traceroute.org
for starters), it should become quite clear that there is a plethora of
tld[12].ultradns.net's out there.

Perhaps a brief description of anycast is in order for the NANOG
FAQ? It seems to come up periodically.

--msa

Todd_Vierling · September 18, 2003, 11:38am

: ultradns uses the power of anycast to have these ips that appear
: to be on close subnets in geographyically diverse locations.

Oh, that's brilliant. How nice of them to defeat the concept of redundancy
by limiting me to only two of their servers for a gTLD.

VeriSign might be doing some loathsome things lately, but at least my named
has several more servers than just two to choose from.

: could you provide some more technical details, other than
: your postulations that they have two machines on
: network-wise close subnets and that is the problem?

I tracerouted to both IPs from two different locations in the USA; both took
the same route before hitting !H from an ultradns.com rDNS machine. And
both servers for that route were completely unresponsive from both tried
locations during the outage period.

Todd_Vierling · September 18, 2003, 11:50am

: I didn't have a problem with .org this evening, and I've asked
: around and others don't seem to have noticed anything either. It would be
: more helpful if you told us your source prefix, and which filter you're
: hitting when you traceroute to tld[12].ultradns.net.

12 dellfweqab.ultradns.net (204.74.103.2) 24.811 ms !H

Same machine for both tld1 and tld2, seen through XO last night and Verio
this morning, from source prefix 66.56.64.0/19 (as well as two others, one
on the US east coast and one in US midwest which I cannot name publicly).

So as far as my machine's source address is concerned, even if the servers
are anycast, there are still only two servers which reside on a single point
of failure. Anycasting doesn't help me one whit if there are only two
servers for my named to choose and both of the ones visible from my location
are down (even though their routes are up) -- this is IMNSHO irresponsible
for a gTLD operator.

If anycast is the game, there should be much more than just two addresses to
choose. Ideally, there should be about six, and certain servers should
deliberately *not* advertise certain anycast networks, in an overlap mesh
that allows one point to fail while others still respond. For instance:

USA server location A advertises networks 1, 3, 5;
USA server location B advertises networks 1, 3, 4;
Europe server location A advertises networks 3, 4, 6;
Asia server location A advertises networks 2, 5, 6;

or something to that effect.

Iljitsch_van_Beijnum · September 18, 2003, 11:57am

: ultradns uses the power of anycast to have these ips that appear
: to be on close subnets in geographyically diverse locations.

Oh, that's brilliant. How nice of them to defeat the concept of redundancy
by limiting me to only two of their servers for a gTLD.

Well, for me one goes to London and the other to Washington, so from where I'm sitting there is geographical diversity.

But having only two servers and anycast those is nonsense. That means I have to depend on BGP to get to the closest server. This is something BGP is really bad at. DNS servers on the other hand track RTTs for query responses and really *know* which server is the fastest rather than guess based on third hand routing information.

And more importantly: if there is only a single working server, everyone in the world is able to reach it. With anycast it can easily happen that you're transported to the nearest dead server.

For the root anycasting makes some sense as it's impossible to add more real root servers because of packet size limitations (but I hope they're smart enough to keep some non-anycasted root servers around), but with only two servers listed, org really doesn't need anycasting.

the same route before hitting !H from an ultradns.com rDNS machine.

What's up with those host unreachables anyway? I wouldn't be surprised if there are IP stacks that cache these. Then if you do a ping to one of the org servers and get a host unreachable, any subsequent DNS queries will be dropped locally as well. There are other ICMP responses that make much more sense for what they're trying to do.

Stephen_J_Wilcox1 · September 18, 2003, 12:10pm

: ultradns uses the power of anycast to have these ips that appear
: to be on close subnets in geographyically diverse locations.

Oh, that's brilliant. How nice of them to defeat the concept of redundancy
by limiting me to only two of their servers for a gTLD.

VeriSign might be doing some loathsome things lately, but at least my named
has several more servers than just two to choose from.

hmm not convinced about your argument here.. perhaps you need to read more about
how this works.

they have two distinct servers by IP, globally they have N x clusters. i'm sure
each instance is actualyl more than a single linux PeeCee

within the cluster there will be health monitoring, in the event of a total loss
of the named daemons on all machines at that site that cluster will (should)
withdraw its anycast rotues and thus send you to one of their many other
systems. this also applies in the event of an external problem (ddos, upstream
failure etc)

so even if what i see as tld1 now goes into failure.. for the minute or two it
takes to go offline and reconverge on antoerh tld1 i still see tld2

: could you provide some more technical details, other than
: your postulations that they have two machines on
: network-wise close subnets and that is the problem?

I tracerouted to both IPs from two different locations in the USA; both took
the same route before hitting !H from an ultradns.com rDNS machine. And
both servers for that route were completely unresponsive from both tried
locations during the outage period.

maybe its firewalled? I see !H too but my .org is working fine for dns resolving

Steve

Todd_Vierling · September 18, 2003, 12:12pm

: BIND does it but what about Microsoft cache/forwarder? At RIPE 45 (you
: were there), a talk by people at CAIDA showed that A.root-servers.net
: received twice as much traffic as the other root name servers since it
: is just the first one listed...

There's an easy fix to that particular situation: Make the first (or first
two) listed servers anycast, and the rest unicast.

That gains the distributed nature of anycast to deal with crap like this,
while keeping the ability for DNS servers to find one that is *up*.

Todd_Vierling · September 18, 2003, 12:19pm

: they have two distinct servers by IP, globally they have N x clusters. i'm sure
: each instance is actualyl more than a single linux PeeCee

Doesn't matter if it's a cluster at each location. The fact remains that
there were only two IP addresses visible to my named, and both were
unresponsive to my machine. As far as my machine was concerned, .ORG was
down for the count, no matter how many servers, that were "invisible" to me,
were still working.

: so even if what i see as tld1 now goes into failure.. for the minute or two it
: takes to go offline and reconverge on antoerh tld1 i still see tld2

The routes I saw never went offline, as far as I could tell -- and from my
location tld1 and tld2 have the *same* route and end up at the same physical
connectivity location. So much for redundancy.

: maybe its firewalled? I see !H too but my .org is working fine for dns resolving

Yes, it is firewalled. I was pointing out that the route is the same for
tld1 and tld2 for me, all the way up to the firewall.

Todd_Vierling · September 18, 2003, 12:26pm

: > There's an easy fix to that particular situation: Make the first (or first
: > two) listed servers anycast, and the rest unicast.
:
: It would require a central management (or at least a central
: oversight) of the root name servers and I do not believe there is one:
: each root name server anycasts at will, without a leader saying ("A
: and B will anycast, the others will stay unicast").

Well, that's something for the root server operators to think about and
discuss amongst themselves. I know several of them are reading this list,
and may be reading this thread.

Still doesn't help .ORG, which is 100% anycast and thus has no DNS-based
redundancy (see my experience elsewhere in this thread).

Todd_Vierling · September 18, 2003, 12:40pm

: > Still doesn't help .ORG, which is 100% anycast and thus has no DNS-based
: > redundancy
:
: Wrong since there are two IP addresses. They may fail at the same time
: (which apparently happened to you) but there is a least an element of
: non-BGP redundancy (I'm not aware of any TLD running with only one
: anycasted name server, although it would still have some redundancy).

Okay, let me qualify then:

"...no DNS-based redundancy when both routes point to the same place and
that particular place goes off the air while its BGP advertisements stay
up and running..."

DNS-based redundancy typically implies going to different servers at
different locations, regardless of what BGP says. The fact that anycast
took me to the same place for both IPs, and that same place went down all at
once, means that I was effectively looking at a single point of failure with
no way for DNS to pick another place to look.

Rodney_Joffe1 · September 18, 2003, 12:42pm

Todd Vierling wrote:

Yes, it is firewalled. I was pointing out that the route is the same for
tld1 and tld2 for me, all the way up to the firewall.

Please post traceroutes from your location, as well as from the two
locations in different parts of the USA (You said earlier: "I
tracerouted to both IPs from two different locations in the USA; both
took the same route before hitting !H from an ultradns.com rDNS machine.
")

Then please post the results of sho ip bgp 204.74.112.1 and sho ip bgp
204.74.113.1 from your location.

Thanks

Stephen_J_Wilcox1 · September 18, 2003, 1:07pm

: > There's an easy fix to that particular situation: Make the first (or first
: > two) listed servers anycast, and the rest unicast.
:
: It would require a central management (or at least a central
: oversight) of the root name servers and I do not believe there is one:
: each root name server anycasts at will, without a leader saying ("A
: and B will anycast, the others will stay unicast").

Well, that's something for the root server operators to think about and
discuss amongst themselves. I know several of them are reading this list,
and may be reading this thread.

Plus, A is verisign so any hopes of cluefulness or working for the community are
fading fast!

Still doesn't help .ORG, which is 100% anycast and thus has no DNS-based
redundancy (see my experience elsewhere in this thread).

It does - there are two! Yuo just mean less than 13 as per the root.

What is the maximum number you can fit in a single NS reply for a 3 letter tld
such as .com/.org ? (Is it still 13? I'm not familiar with the DNS protocol at
that level)

Steve