Cisco CRS-1 vs Juniper 1600 vs Huawei NE5000E

How do you engineer around enterprise and ISP recursors that don't honor TTL, instead caching DNS records for a week or more?

A friend of mine was working for a place that performed some service on data (not important what, you send them some data (through this really ugly client app that they wrote in-house) and they sent you back something...).

Anyway, for various reasons they needed to move out of their current data-center to a new provider. They had this truly monumental plan for doing this that they had been working on for months --- MS Project printouts that covered entire walls in this huge rainbow of colors, 400 or so pages of plans, etc etc etc -- it all boiled down Friday. As soon as the TTL expired everything would start working in the new place and it will all be transparent to the end users...

Anyway, my friend calls me at like 3 in the morning on Saturday -- they have updated DNS and none of their clients are connecting to the new place... It seems that they have burnt some bridges with the old provider and will be shut off on Saturday evening -- he's really desperate, so I agree to wander over and take a look...

I arrive to find utter confusion -- the CEO is screaming at the CTO, who appears to have decided that the best way to fix things is by getting drunk, random other people are screaming (apparently just for fun), etc.... I manage to get someone to calm down for long enough to explain the summary of the plan to me and run nslookup.. Sure enough the TTL is really low and the new IP is being handed out, etc.

I ask how long it took for the client to fail over during their tests -- "Oh, no, we didn't test like that, we didn't want to impact the current service, so we tested with a different domain and checked how long it took for a IE to pick up the change... It was less than 10 minutes..."

We track down one of the developers and talk to him. He explains this long and involved system with the client performing heath-checks on the server and reconnecting wit exponential back-off, etc etc etc. Its all great -- apart from the fact that he calls gethostbyname() during startup, and then never again....

This is a *really* common issue....

W

Actually, I think the fact Zombies do not honor TTLs is a feature. :slight_smile:

Very interesting. We've all heard and probably all passed along
that little
bromide at one time or another. Is it possible that at one time
it was true
(even possibly for AOL) but with the rise of CDNs, policies of not

honoring
TTL's have fallen by the wayside?

I think you'll still see it in spam zombies, some of which have the

DNS info
pre-loaded into them in order to avoid split-horizon anti-spam
techniques.

Not much we can do about that until we get sufficient backbone to

deal

with the zombie problem and its software enablers.

Actually, I think the fact Zombies do not honor TTLs is a feature.

:slight_smile:

Fast-flux my MX records to avoid spam? Throw the spammers'
technology back at 'em!

I changed some MX records in mid-July for a domain. Spam was
still flowing into the old MX hosts until I closed the firewall
25/tcp holes just today. Now just logging those zombies still
banging on the gates.

So, I'd also ask this, do you know it's the recursive server, or is the
behavior that you see related more to the application caching and not
respecting the TTL? (IE for instance and it's default 30 minute, I think,
ttl).

How does a CDN tell the recursive server is doing this vice the client
app?

Date: Fri, 03 Aug 2007 19:47:44 -0300
From: "Giuliano (UOL)" <giulianocm@uol.com.br>
Subject: Re: Cisco CRS-1 vs Juniper 1600 vs Huawei NE5000E

You can use Foundry XMR box.
It has excellent performance under MPLS, BGP and Multicast Networks.
But ... I never saw it under extreme conditions with IPv6 ...

If you need sane MTU controls on both L2 and L3 stay *very* *far* away
from Foundry gear... Despite years and years of telling them they need
to allow different MTU settings both at the VLAN as on the VE level
they still Don't Get It (TM)... :frowning: And I definitely know I'm not the
only one who's repeatedly asked them about it. :frowning:

Apart from that, if you need basic IPv4 stuff, aka not too fancy
terribly new things, they have a very decent platform, with far lower
port costs then C or J. And performance is also very good, especially
since you get L2 and L3, whereas with J you'd need to go with the
(very new) MX960, whose L2 featureset still eludes me, or the proven
6509's (with beefy sup's) from C...

Kind regards,
JP Velders

Of course the CDN wouldn't know or care, it would however possibly
lead to that user experiencing negative performance or availability
outside the realm of the CDN's control. I know where we are we move
things via dns atleast 2xTTL early, usually more, but emergency
situations (i.e. taiwan earthquake, fiber cuts, oc192 outages etc.)
would have serious negative implications for users or servers
ignoring the TTL- it is sometimes set for a legitimate reason.

I'm sorry, I think you or I miscommunicated... What I was asking, as 'how
does the CDN or CDN operator know that the recursive server is ignoring
the TTL returned with the RR as opposed to client side issues?' Your
response didn't really answer that part. I ask because you seem to have
data proving ( to some extent ) that 'many nameservers ignore ttls', I was
curious how you'd gathered that, and how you'd know that the nameservers
(aside from querying them directly) were ignoring TTLs on RRs.

Thanks.

* Rodney Joffe:

Do you have any real examples of significant recursive servers doing
this?

nscd in GNU libc has issues related to cache expiry. I'm not sure if
it is general brokenness, or some TTL-related issue. It's use is not
terribly widespread, and it's a host-specific cache only, but there's
a certain installation base.

* Rodney Joffe:

Do you have any real examples of significant recursive servers doing
this?

nscd in GNU libc has issues related to cache expiry. I'm not sure if
it is general brokenness, or some TTL-related issue. It's use is not
terribly widespread, and it's a host-specific cache only, but there's
a certain installation base.

Thanks Florian. So this looks like a code "feature", not stupid behavior by deployers. I'll keep a note when we fingerprint misbehaving systems in the future.

/rlj

nscd does this on many platforms (solaris for instance) there's a config
bit in nscd.conf:
positive-time-to-live hosts 3600

that sets a lower-bar on TTL in the nscd cache -

(from the manpage for nscd.con)

     positive-time-to-live cachename value
           Sets the time-to-live for positive entries (successful
           queries) in the specified cache. value is in integer
           seconds. Larger values increase cache hit rates and
           reduce mean response times, but increase problems with
           cache coherence. Note that sites that push (update)
           NIS maps nightly can set the value to be the
           equivalent of 12 hours or more with very good perfor-
           mance implications.

This is still a client issue as, hopefully, the cache-resolvers don't
funnel their business through nscd save when applications on them need
lookups... (things like ping/telnet/traceroute/blah)

-Chris

"Chris L. Morrow" <christopher.morrow@verizonbusiness.com> writes:

that sets a lower-bar on TTL in the nscd cache -

(from the manpage for nscd.con)

     positive-time-to-live cachename value
           Sets the time-to-live for positive entries (successful
           queries) in the specified cache. value is in integer
           seconds. Larger values increase cache hit rates and
           reduce mean response times, but increase problems with
           cache coherence. Note that sites that push (update)
           NIS maps nightly can set the value to be the
           equivalent of 12 hours or more with very good perfor-
           mance implications.

This is still a client issue as, hopefully, the cache-resolvers don't
funnel their business through nscd save when applications on them need
lookups... (things like ping/telnet/traceroute/blah)

nscd may represent a problem if the application in question is a
http-proxy without it's own resolver. There's also a number of
more-or-less broken http-proxies doing their own resolver caching
regardless of actual TTL.

Such applications represent a problem wrt any DNS-based load balancing,
including CDNs, since they can serve a large number of end-users,
redirecting them to the "wrong" address long after the TTL should have
expired.

Bjørn

"Chris L. Morrow" <christopher.morrow@verizonbusiness.com> writes:

> This is still a client issue as, hopefully, the cache-resolvers don't
> funnel their business through nscd save when applications on them need
> lookups... (things like ping/telnet/traceroute/blah)

nscd may represent a problem if the application in question is a
http-proxy without it's own resolver. There's also a number of
more-or-less broken http-proxies doing their own resolver caching
regardless of actual TTL.

that's fine, that's still a client problem, not a cache-resolver
problem... These devices look 'upstream' for a cache-resolver to do their
dirty work, these just add an extra layer of indirection for the CDN to
figure out (my client is in SFO, my proxy is in IAD, my cache-resolver is
in CHI).

Such applications represent a problem wrt any DNS-based load balancing,
including CDNs, since they can serve a large number of end-users,
redirecting them to the "wrong" address long after the TTL should have
expired.

Yup, people should be aware of what the systems in their path are doing,
or as was mentioned earlier, have lots of exceptions on the CDN side.