[NANOG] Did Youtube not pay their domain bill?

sthaug · May 3, 2008, 1:27pm

Did Youtube not pay their domain bill?

% dig @a.gtld-servers.net. ns yotube.com

yotube.com. 2D IN NS ns1.parked.com.
yotube.com. 2D IN NS ns2.parked.com.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

Niels_Bakker · May 3, 2008, 1:29pm

* sthaug@nethelp.no [Sat 03 May 2008, 15:28 CEST]:

Did Youtube not pay their domain bill?

^^

% dig @a.gtld-servers.net. ns yotube.com

^
Still early, Steinar?

-- Niels.

Kipkemoi_Kibiego · May 3, 2008, 1:33pm

yotube.com != youtube.com

sthaug@nethelp.no wrote:

sthaug · May 3, 2008, 1:35pm

>Did Youtube not pay their domain bill?
^^
>
>% dig @a.gtld-servers.net. ns yotube.com
^
Still early, Steinar?

You're right, clearly insufficient amounts of coffee here...

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

Eric_Spaeth · May 3, 2008, 1:40pm

Still down either way...

Bandy_Rush1 · May 3, 2008, 1:45pm

dns1.sjl.youtube.com. 172800 IN A 208.65.152.201
dns2.sjl.youtube.com. 172800 IN A 208.65.152.137

2182 lesson again, probably. after all, microsoft/hotmail/... being
borked for a day can't happen to me!

randy

Brant_I_Stevens · May 3, 2008, 1:50pm

Maybe that block is anycasted?

Brant_I_Stevens · May 3, 2008, 1:53pm

Never mind. I'll go back to bed now.

Eric_Spaeth · May 3, 2008, 2:16pm

If they were anycasted, shouldn't they be reachable from _somewhere_ ? Those servers are dead from the 4 corners of the US that I have resources to use for testing.

Brant I. Stevens wrote:

David_Coulson · May 3, 2008, 2:25pm

Depends - It doesn't help if the DNS server is dead, but the front-end is still advertising the routes.

It came back to life for me a few moments ago (via Cogent) and it looks like the routing did not change (there is a bunch of 10/8 stuff in the traceroute).

Eric Spaeth wrote:

Marshall_Eubanks3 · May 3, 2008, 3:03pm

I received a report from a user at 9:46 EDT that they couldn't access youtube, so at least some users
were affected.

Regards
Marshall

sthaug · May 3, 2008, 3:07pm

Depends - It doesn't help if the DNS server is dead, but the front-end
is still advertising the routes.

It came back to life for me a few moments ago (via Cogent) and it looks
like the routing did not change (there is a bunch of 10/8 stuff in the
traceroute).

Looks like it's back here. However, they clearly have more problems.
At the moment, the two name servers that the youtube.com domain is
delegated to,

dns1.sjl.youtube.com. 1H IN A 208.65.152.201
dns2.sjl.youtube.com. 1H IN A 208.65.152.137

reply ok. However, they reply that the youtube.com domain is served by

youtube.com. 1H IN NS dns1.sjl.youtube.com.
youtube.com. 1H IN NS dns2.sjl.youtube.com.
youtube.com. 1H IN NS dns3.sjl.youtube.com.
youtube.com. 1H IN NS sjl-ins2.sjl.youtube.com.

but reply with NXDOMAIN when asked for the A of sjl-ins2.sjl.youtube.com.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

Michael_Lewinski · May 3, 2008, 5:27pm

David Coulson wrote:

Depends - It doesn't help if the DNS server is dead, but the front-end is still advertising the routes.

Possibly a good argument for allowing the DNS servers to originate the routes for them...? I've seen configuration where the routes were injected based on link state via crossover cable, so at least if the whole machine pukes the route is dropped. But if the resolver or OS itself is just hung then yeah...

Mike

thekev · May 3, 2008, 7:23pm

We did that with our internally anycasted recursors at my former
network. A script withdraws the routes if bind isn't answering. Works
great.

Steve_Gibbard · May 3, 2008, 10:10pm

Running Quagga or something similar on the anycasted server to announce the routes is the standard way of setting up anycast. That way, if the server fails completely, the route goes away.

A common improvement on that is to run a script on the server that checks to make sure the name server process is running and responding correctly, and kills BGP if it isn't. That covers cases where named has problems that don't take down the whole server.

The first one depends heavily on the server, if it's going to fail, failing the right way. If it loses power or stops all processes, the route goes away and traffic gets redirected elsewhere. If named dies or stops responding, you're stuck. The second method solves a lot of that sort of problems, but there are still conceivable situations where the server would get into a state of partial failure and be unable to withdraw the route. Still, that's probably the best option. Another approach would be to run the monitoring script and BGP process on a separate host that would presumably be healthy even when the name server host is in trouble. That approach has issues too, in that you lose the guarantee that the route will go away if the name server box is turned off.

The right solution is to design the anycast servers to be as sure as possible that the route will go away when you want it gone, but to have multiple non-interdependent anycast clouds in the NS records for each zone. If the local node in one cloud does fail improperly, something will still be responding on the other cloud's IP address.

Note that any of these failure scenarios is still preferable to what you get with unicast servers. With unicast, if the server has trouble, the route always stays up, and the the traffic always ends up in a black hole.

-Steve

Bandy_Rush1 · May 3, 2008, 10:17pm

Eric Spaeth wrote:

If they were anycasted, shouldn't they be reachable from _somewhere_

not if routing problem with the prefix. anycasted prefixes have
analogous problem to that described in 2182. need at least two
separately routed prefixes or single method of failure.

randy

Paul_Vixie1 · May 4, 2008, 4:19pm

scg@gibbard.org (Steve Gibbard) writes:

> David Coulson wrote:
>> Depends - It doesn't help if the DNS server is dead, but the front-end
>> is still advertising the routes.
>
> Possibly a good argument for allowing the DNS servers to originate the
> routes for them...? I've seen configuration where the routes were

Running Quagga or something similar on the anycasted server to announce
the routes is the standard way of setting up anycast. That way, if the
server fails completely, the route goes away.

that's what joe said to do in <http://www.isc.org/pubs/tn/isc-tn-2004-1.txt>\.

A common improvement on that is to run a script on the server that checks
to make sure the name server process is running and responding correctly,
and kills BGP if it isn't. That covers cases where named has problems
that don't take down the whole server.

in ISC-TN-2004-1 [ibid], appendix D, joe suggests bringing up and down the
interface BIND listens on (which presumes that it's a dedicated loopback
like lo1 whose address is covered by a quagga route advertisement rule).

note that joe's example brings up the interface before starting the name
server program, and bringing it down if the name server program exits.
this presumes that the name server will start very quickly, and that while
running, it is healthy. since i've seen name server programs be unhealthy
while running, and/or take a long time to start, i'm now considering an
outboard shell script that runs some kind of DNS query and decides, based
on the result, whether to bring the dedicated loopback interface up or down.

...
The right solution is to design the anycast servers to be as sure as
possible that the route will go away when you want it gone, but to have
multiple non-interdependent anycast clouds in the NS records for each
zone. If the local node in one cloud does fail improperly, something will
still be responding on the other cloud's IP address.

the need for multiple independent anycast clouds is an RFC 2182 topic, but
joe's innovation both in ISC-TN-2004-1 and in his earlier ISC-TN-2003-1 (see
<http://www.isc.org/pubs/tn/isc-tn-2003-1.txt> is that if each anycast cluster
is really several servers, each using OSPF ECMP, then you can lose a server
and still have that cluster advertising the route upstream, and only when you
lose all servers in a cluster will that route be withdrawn.

Note that any of these failure scenarios is still preferable to what you
get with unicast servers. With unicast, if the server has trouble, the
route always stays up, and the the traffic always ends up in a black hole.

here, the real problem is the route staying up, which also blackholes anycast.
the only things DNS anycast universally buys you are DDoS resilience and
hot swap. anything else anycast can do (high availability, low avg. RTT, etc)
can also be engineered using a unicast design, though probably at higher TCO.

Deepak_Jain · May 5, 2008, 4:32am

note that joe's example brings up the interface before starting the name
server program, and bringing it down if the name server program exits.
this presumes that the name server will start very quickly, and that while
running, it is healthy. since i've seen name server programs be unhealthy
while running, and/or take a long time to start, i'm now considering an
outboard shell script that runs some kind of DNS query and decides, based
on the result, whether to bring the dedicated loopback interface up or down.

All deference to this model, we've all seen these kinds of problems with name servers. We *can* be certain that bringing a loopback interface up or down takes almost no time (with the implied effect to a speaker like Quagga). There is *no* reason with a sufficiently deep name server depth (depends on your load) that your monitoring script should *need* to hurry to test this condition. Every 5-10 or even 15 minutes to see if its eligible to bring up, more frequently to see if its eligible to take down. This also reduces oscillation.

This means, bring up/kill off your name server in one cronjob (automatically taking the interface down at the end or after a kill), and monitor/talk to the interface in another (up function and sometimes the down).

You'll be much happier.

Deepak Jain
AiNET

Steve_Gibbard · May 5, 2008, 10:25am

This is getting into minutia, but using multipath BGP will also accomplish this without having to get the route from OSPF to BGP. This simplifies things a bit, and makes it safer to have the servers and routers under independent control.

But yes, Joe's ISC TechNote is an excellent document, and was a big help in figuring out how to set this up a few years ago.

-Steve

Paul_Vixie1 · May 5, 2008, 4:07pm

scg@gibbard.org (Steve Gibbard) writes:

> ... if each anycast cluster is really several servers, each using OSPF
> ECMP, then you can lose a server and still have that cluster advertising
> the route upstream, and only when you lose all servers in a cluster will
> that route be withdrawn.

This is getting into minutia, but using multipath BGP will also accomplish
this without having to get the route from OSPF to BGP. This simplifies
things a bit, and makes it safer to have the servers and routers under
independent control.

i think the minutia is good, especially after a long weekend of layer 9
threads. my limited understanding of multipath bgp is that it's a global
config knob for routers, not a per peer knob, and that it has disasterous
consequences if the router is also carrying a full table and has many peers.
also, in OSPF, ECMP is not optional, even though most BSD-based software
routers don't implement it yet (since multipath routing is very new.) so,
we have been using OSPF for this, it just works out better. i dearly do
wish that something like a "service advertisement protocol" existed, that
did what OSPF ECMP did, without a router operator effectively giving every
customer the ability to inject other customer routes, or default routes.
in that sense, i agree with your "safer... independent control" assertion.

But yes, Joe's ISC TechNote is an excellent document, and was a big help
in figuring out how to set this up a few years ago.

and now for something completely different -- where in the interpipes could
a document like that have been published, vs. ISC's web site? the amount
of red tape and delay involved in Usenix or IETF or IEEE or ACM are vastly
more than most smart ops people are willing to put in. where is the light /
middle weight class, or is every organization or person who wants to publish
this kind of thing going to continue to have the exclusive and bad choice of
"blog it, or write an article for ;login:/ACM-Queue/Circle-ID, or write an
academic paper and wait ten months"? isn't this a job for... NANOG?