Problems with NS*.worldnic.com

I saw some mention of this in a previous thread. Is anyone else still experiencing problems? We're seeing general slowness and the use of the truncate bit in responses, forcing to TCP mode.

We have few servers with Interland / Miami. Today for around 1 hour 15
minutes the dns / tcp traffic was timing out.

Httpd was very slow for domains with backup dns servers in Europe but other
domains with DNS within Interland only was not resolving at all.

I only noticed that traffic was not going through from here, Saudi Arabia
but it appeared to be resolving okay from United States.

I do not know if this is related to worldnic dns problem, but I think
Interland is outsourcing DNS from Verisign.

aljuhani@riyadmail.com

something *very* strange is going on. the worldnic servers have
been giving delayed or no results for days now. and nsi is hoping
we and the wsj/nyt won't notice.

i don't think this

    roam.psg.com:/usr/home/randy> doc -p -w worldnic.net
    Doc-2.1.4: doc -p -w worldnic.net
    Doc-2.1.4: Starting test of worldnic.net. parent is net.
    Doc-2.1.4: Test date - Mon Apr 25 14:20:45 HST 2005
    ;; res_nsend: Protocol not supported
    DIGERR (UNKNOWN): dig @a.gtld-servers.net. for NS of worldnic.net. failed
    ;; res_nsend: Protocol not supported
    DIGERR (UNKNOWN): dig @b.gtld-servers.net. for NS of worldnic.net. failed

is the worldnic problem, but could be. but it is a problem. (i
generally ignore b root issues).

but it's probably time for us all to dump symptoms here and figure
it out as a community, as the dog with the bone ain't 'fessing up.

randy

a.gtld-servers.net and b.gtld-servers.net have AAAA records. Some
applications and stacks try the v6 address first if it's available and
will appear to hang if you don't have v6 connectivity. That may very
well be what's happening here.

Matt

Matt Larson wrote:

a.gtld-servers.net and b.gtld-servers.net have AAAA records. Some
applications and stacks try the v6 address first if it's available and
will appear to hang if you don't have v6 connectivity. That may very
well be what's happening here.

Are the AAAA records for a & b.gtld-servers.net new?

Randy, and others with this issue...

something *very* strange is going on. the worldnic servers have
been giving delayed or no results for days now. and nsi is hoping
we and the wsj/nyt won't notice.

i don't think this

    roam.psg.com:/usr/home/randy> doc -p -w worldnic.net
    Doc-2.1.4: doc -p -w worldnic.net
    Doc-2.1.4: Starting test of worldnic.net. parent is net.
    Doc-2.1.4: Test date - Mon Apr 25 14:20:45 HST 2005
    ;; res_nsend: Protocol not supported
    DIGERR (UNKNOWN): dig @a.gtld-servers.net. for NS of worldnic.net. failed
    ;; res_nsend: Protocol not supported
    DIGERR (UNKNOWN): dig @b.gtld-servers.net. for NS of worldnic.net. failed

is the worldnic problem, but could be. but it is a problem. (i
generally ignore b root issues).

but it's probably time for us all to dump symptoms here and figure
it out as a community, as the dog with the bone ain't 'fessing up.

I spent some time two months ago chasing this down with the same two
gtld-servers.net records, on my mac.

The culprit is dig.

On any system that is both ipv4 and ipv6 enabled, *even if there is no ipv6
connectivity*, dig - which has no awareness of ipv6 v. ipv4 - attempts to
connect using the ipv6 address if both an ipv6 and an ipv4 address is
provided. When it fails to connect, it fails. It does not do the "better"
thing, which is to try another address for the same hostname, in this case
the ipv4 address. If you attempt a dig for the ipv4 address of
a.gtld-servers.net or b.gtld-servers.net, you will get your answer.

I am not sure whether the correct solution is to "fix" dig so that is tries
ipv4, or to get the os "fixed" on a dual stack capable system so that if
there is not ipv6 connectivity it disables that part of the system. I
suspect the first is appropriate, because there are obviously internal
processes that may validly want to use ipv6 even though there is no ipv6
connection. I also suspect that the same thing would occur for an A record
that had multiple ipv4 addresses in a round robin configuration, but where
the "first" ip address was unreachable, the behavior would be the same as if
it was an ipv6 address being tried on a system that had no ipv6
connectivity.

But I have not taken the time to test. I did ask the maintainers of dig to
look at this, and perhaps provide a solution.

Rodney Joffe
CenterGate Research Group, LLC
http://www.centergate.com
"Technology so advanced, even WE don't understand it"(R)

I'd say fix the resolver to not try resolve v6 where there exists no
v6 connectivity

-srs

The problem is that you *could* have local/campuswide ipv6 connectivity,
but not have an IPv6 connection to the outside world. So my system comes
up, it sees a Router Advertisement, it can get to other IPv6 systems that
are 3-4 hops away.

So how is it supposed to "know" that it doesn't have an ipv6 connection?

Presumably the same way it "knows" it doesn't have an ipv4 connection when
your OC-moby to the outside world falls over, but it's still perfectly able to
talk to the entire rest of your corporate network....

So how is it supposed to "know" that it doesn't have an ipv6 connection?

in my case, because
  o no interfaces have v6 addresses
  o v6 stack is not present
  o ...

it should also not use smoke signals, analog voice phone, ...

the chances of a box having a v6 connection to *anything* today
is low, and should not be a reason to *break* v4 services.

randy

Perhaps a solution is to specifically enable ipv6 dns resolution as preferable to ipv4 or the other way around. This could perhaps be
switch in resolv.conf or nsswitch.conf. Something like:

/etc/nsswitch.conf
...
hosts: files dns
dns-resolver: AAAA [NOTFOUND=return] A6 A

Note: in this meaning NOTFOUND is only true when NXDOMAIN but not for NODATA

OR

/etc/resolv.conf
search example.com
protocol ipv6 ipv4

something *very* strange is going on. the worldnic servers have
been giving delayed or no results for days now. and nsi is hoping
we and the wsj/nyt won't notice.

I agree 100%.

but it's probably time for us all to dump symptoms here and figure
it out as a community, as the dog with the bone ain't 'fessing up.

randy

I'll bite.

I couldn't resolve ns*.worldnic.com domains until I finally bit the bullet, and unblocked port 53 TCP from my DNS server. Then it worked fine. (after a few tries) I'm using BIND 9.2.4 without the eye pee vee six stuff compiled in. Because I don't want to start something; No discussion about me blocking port 53, ok? I got tired of gobs of log files of script kiddies trying to download my domains 5 years ago... I actually READ my logs.... besides, I had to keep the linux boxes safe from the tyranny of bind 8 until they got upgraded. :slight_smile:

-Jerry

At least on my system, there's an 'options inet6' line that makes it look
for AAAA records, and mapping ipv4 into ipv6 addresses if only an A record
is found.

Also note that it doesn't fix the problem that's being seen - I might
be able to contact the nameservers listed in resolv.conf via both IPv4
and IPv6 - the fun starts when my nameserver gets an NS entry that contains
an AAAA record, and the nameserver has enough IPv6 connectivity to think
it's worth a try, but you can't get there from here...

Suresh Ramasubramanian wrote:

I'd say fix the resolver to not try resolve v6 where there exists no
v6 connectivity

I'd say fix the broken v6 connectivity.

- Kevin

Ahh, dig. What version? You have to be running the latest at all times these days...so many changes...

In my experiences with v6 the problems I have come down two are:

1) Broken testing tools. (See change 1610 in the BIND CHANGES file for one.)

2) Broken route policy. (Dasterdly ISP's!)

3) Broken OS API's. (Have we learned nothing since or from Berkeley Sockets?)

#1 - I've had to reevaluate everything I know about debugging since I met IPv6. Now there's an entirely alternate universe of failure to consider.

One day I was sitting in RIPE NCC's offices and couldn't 'dig @ns.ripe.net'. So I walked to the ops room and asked, "umm, is your big machine down." After a good laugh, we figured that my Mac was trying v6 where v6 wasn't *really* live.

#2 - When I first got real live IPv6 service from a provider, I tried tracerouting to all the machines I knew about - the roots as listed on root-servers.org, the RIPE machines. I'd get about halfway there and fail. I asked for reverse traces from the other side and see failures about the same place.

We had to work with ISPs to loosen route policies.

#3 - I have seen all sorts of mistakes involving OS's, OS API's, and app software API's. Mapped addresses are mishandled, having more than one address to try is something apps don't deal with. (Like they've been force fed one kind of food their entire life, and now have to choose from a menu.)

At NANOG last year I related my problems with ssh (choosing v6 over v4 - and me assigning the same domain name to two machines, one on a v4 net and one on a v6 net). Stupid me...

The biggest problem was that one type of machine kept dropping its statically configured default v6 route. Packets would get in, but they didn't know where to go next. The machine logged all activity as good though...it didn't know.

I posted to NANOG:

fine. (after a few tries) I'm using BIND 9.2.4 without the eye pee
vee six stuff compiled in. Because I don't want to start something;
No discussion about me blocking port 53, ok? I got tired of gobs of
log files of script kiddies trying to download my domains 5 years
ago...

Steve Sobol replied with:

I'm not going to enter into a long discussion with you. :slight_smile:

I'm just curious why you didn't restrict AXFR to certain IPs instead.

And I'm posting back to NANOG:

I did.

And I had router ACLs doing the same thing. Allow to hosts that needed it, deny for everyone else. And I did this to ALL my DNS servers.

I was getting DoSed one day, somewhere in the whereabouts of about 2001, and put in the ACLs, immediately expecting it to break things. (truncated responses needing TCP and/or other things that I didn't foresee). Much to my dismay, it broke nothing. Despite me looking for problems, and asking and pleading my techies to find trouble tickets related to this issue, it didn't happen. I revisited the issue periodically. Every time there was an unexplained DNS issue, I would think "it must be the port 53 block!" but alas, I was disappointed each and every time. I've removed and added the ACLs countless times over the years trouble shooting various DNS issues, but this is the first time that removing them actually solved anything.

See, I *WANTED* there to be a problem in blocking port 53, I *BELIEVED* all the talk that it would cause problems, but that problem never showed up. Over the years, eventually I just slowly arrived at the conclusion that all the talk were from people who talked, not from people who were brave enough to try it in a production environment.

4 years later, I was proved "inconclusive": Blocking port 53 does break things to servers that are already (apparently?) broken.

-Jerry

Jerry Pasker wrote:

Steve Sobol replied with:

I'm not going to enter into a long discussion with you. :slight_smile:

I'm just curious why you didn't restrict AXFR to certain IPs instead.

And I'm posting back to NANOG:

I did.

And I had router ACLs doing the same thing. Allow to hosts that needed it, deny for everyone else. And I did this to ALL my DNS servers.

What were the router ACLs doing that the DNS server ACLs weren't/couldn't?

This, it seems, was an unfortunate side effect (as I pointed out earlier)
of legacy software and legacy config... if I had to guess.

Steve Sobol allegedly replied to my reply with:

What were the router ACLs doing that the DNS server ACLs weren't/couldn't?

The ACLs were doing it for the entire server network. Since I prefer my job as a router-rat over everything else I do, I find it easiest to use the biggest hammer available to me when dealing with DoS attacks. One router ACL vs. 10 server ACLs? When I'm under attack I'll take the one router ACL. Then, per their request, I added it to the networks that my collocation clients were on. They were getting 0wn3d regularly, and it really simplified my life in a time when new BIND 8 exploits were coming out every 4 minutes. The router ACLs made my life easier, not harder. Besides, it's my ASN, and I can do what I want. :wink:

Christopher L. Morrow allegedly wrote:

This, it seems, was an unfortunate side effect (as I pointed out earlier)
of legacy software and legacy config... if I had to guess.

You guess wrong. See the above. And don't pass judgement. (am I being sited for lack of clue? It kind of feels like it) It wasn't a *BAD* thing, it was a *GOOD* thing. It made things better, not worse. I still may go back and re-implement port 53 blocks in the future if I find a good reason to. I know now that it doesn't really cause operational problems. At least not in a smaller ISP environment. Would I want a transit network to block TCP 53? Of course not. But my end customers request those types of services regularly, so I try to provide what they want.

And don't think I'm coming off as all ticked off and defensive. I'm not ticked off, I'm actually enjoying this. As for being defensive? Maybe. I'm trying hard not to be though. I really can't help myself........I have this lurking fear that I'm being tossed in to the "clueless block TCP 53 with an outsourced firewall, and don't know what I'm doing beyond that" group that I so despise. :wink: Especially on this list, full of people that I have so much respect for.

I knew I was opening myself up a little when I decided to "help out" by sharing my worldnic.com experiences, but figured it was for the good of the group, and therefore, worth it. And I still think that.

-Jerry

Christopher L. Morrow allegedly wrote:

>This, it seems, was an unfortunate side effect (as I pointed out earlier)
>of legacy software and legacy config... if I had to guess.

You guess wrong. See the above. And don't pass judgement. (am I
being sited for lack of clue? It kind of feels like it) It wasn't a

no lack of clue meant, just pointing out one possible cause of the acl
usage. I don't think I saw the original reasoning in the original email.

*BAD* thing, it was a *GOOD* thing. It made things better, not
worse. I still may go back and re-implement port 53 blocks in the
future if I find a good reason to. I know now that it doesn't really
cause operational problems. At least not in a smaller ISP
environment. Would I want a transit network to block TCP 53? Of
course not. But my end customers request those types of services
regularly, so I try to provide what they want.

Sure, this is a form of 'managed security services' and the custommer (and
you) agree to that policy change.

And don't think I'm coming off as all ticked off and defensive. I'm
not ticked off, I'm actually enjoying this. As for being defensive?
Maybe. I'm trying hard not to be though. I really can't help
myself........I have this lurking fear that I'm being tossed in to
the "clueless block TCP 53 with an outsourced firewall, and don't
know what I'm doing beyond that" group that I so despise. :wink:
Especially on this list, full of people that I have so much respect
for.

either way, it was just one possibliity of many for the acl to be there,
nothing more :slight_smile:

good of the group, and therefore, worth it. And I still think that.

excellent, it probably helps Patrick, the world-nic folks and others as
well :slight_smile: