NSI bulletin 097-004 | Root Server Problems

Randy Bush writes:

> Despite alarms raised by Network Solutions' quality assurance schemes, at
> approximately 2:30 a.m. (Eastern Time), a system administrator released
> the zone file without regenerating the file and verifying its integrity.

You allow mere humans to affect the process on which the whole net relies?
Oh my gawd! I demand my money back! :slight_smile:

Actually, I *do* think this is silly, Randy.

At several of my clients, I have the release process for firm internal
DNS completely automated. The DNS in these firms is generated from
relational databases (in one case it is even an Ingres database!), a
nearly identical problem to the one NSI describes.

These are securities trading companies, and without their DNS service
they'd have complete equipment shutdown resulting in the firms being
unable to work -- and the result would likely be that I'd probably
never get a job again in my life, so I take considerable care here. We
have LOTS of backup servers, lots of backup network connections for
all machines, so believe me, this isn't a question of worrying hard
about something that isn't a weak link.

In my automated release system, differences are examined between
yesterdays and todays files, and if they are too large (a settable
constant), the files are not released. About a dozen sanity checks are
run as well. Just in case, old copies of the database are preserved in
case a manual backout is needed.

(By the way, all the tests run fast enough that I'd suspect that a
similar system built for the root zones would be more than practical
even given that they are several thousand times larger.)

There has, in a number of years, never been a catastrophic failure --
the sanity checks have always stopped buggy DNS data from being put
out to the clients. We've never needed to back out, and I've never
worried a single night that I'd wake up and find my carreer was over.

I admit that the problem at NSI is larger by three orders of
magnitude, but essentially the same sort of scripts could be run. If
such scripts were in place at NSI, such failures, which have occurred
multiple times, would never have happened.

Humans CANNOT be trusted with this sort of thing. Humans are
fallible. You can't have humans involved in this sort of release

It is my professional opinion as an engineer who has built systems
almost exactly like the one described that NSI's excuses about its
multiple failures in running the root zones are a reflection of poor
design and management.

If anyone wants to claim that I don't know what I'm talking about they
are welcome to, but in this instance I know very well what I'm talking
about down to the last detail. As I've said, I'VE BUILT THESE THINGS.


Gosh, Perry. I wish I could be and make things that perfect.