How do you engineer around enterprise and ISP recursors that don't honor TTL, instead caching DNS records for a week or more?
A friend of mine was working for a place that performed some service on data (not important what, you send them some data (through this really ugly client app that they wrote in-house) and they sent you back something...).
Anyway, for various reasons they needed to move out of their current data-center to a new provider. They had this truly monumental plan for doing this that they had been working on for months --- MS Project printouts that covered entire walls in this huge rainbow of colors, 400 or so pages of plans, etc etc etc -- it all boiled down Friday. As soon as the TTL expired everything would start working in the new place and it will all be transparent to the end users...
Anyway, my friend calls me at like 3 in the morning on Saturday -- they have updated DNS and none of their clients are connecting to the new place... It seems that they have burnt some bridges with the old provider and will be shut off on Saturday evening -- he's really desperate, so I agree to wander over and take a look...
I arrive to find utter confusion -- the CEO is screaming at the CTO, who appears to have decided that the best way to fix things is by getting drunk, random other people are screaming (apparently just for fun), etc.... I manage to get someone to calm down for long enough to explain the summary of the plan to me and run nslookup.. Sure enough the TTL is really low and the new IP is being handed out, etc.
I ask how long it took for the client to fail over during their tests -- "Oh, no, we didn't test like that, we didn't want to impact the current service, so we tested with a different domain and checked how long it took for a IE to pick up the change... It was less than 10 minutes..."
We track down one of the developers and talk to him. He explains this long and involved system with the client performing heath-checks on the server and reconnecting wit exponential back-off, etc etc etc. Its all great -- apart from the fact that he calls gethostbyname() during startup, and then never again....
This is a *really* common issue....
W