eric.CArroll@acm.ORG (Eric M. Carroll) writes:
Do we actually need the cooperation of the organizations in question to
Yes and no. It would be fairly 'easy' to become a editor, start Donelan's
Journal Of Internet Disasters, get a number of noted experts to contribute
articles analyzing failures with no cooperation of the organizations. But
I can predict what the organizations in question would say about such an
1) Donelan is engaging in FUD to sell his journal.
2) They are making rash assumptions without knowing all the facts.
3) You know the Internet, you can't please everyone. Its just a
small group of people with an axe to grind.
4) It didn't happen. If it did happen, it was minor. If it wasn't
minor, not many people were affected. If many people were
affected, it wasn't as bad as they said. If it was that bad,
we would have known about it. Besides we fixed it, and it
isn't a problem (anymore).
Sure, sometimes a problem breaks through to the public even when the
company tries all those things. Just ask INTEL's PR department about
their handling of the Pentinum math bug. But that is relatively rare,
and not really the most efficient way to handle problems.
For large enough failures, the results are obvious and the data is fairly
clear. Perhaps a first stage of a Disruption Analysis Working Group would
simply be for a coordinated group to gather the facts, sort through the
impact, analyze the failure and report recommendations in a public forum.
I'm going to get pedantic. The results may be obvious, but the cause isn't.
I would assert there are a number of large failures where the initial obvious
cause has turned out to be wrong (or only a contributing factor). Was the
triggering fault for the western power grid failure last year caused by a
terrorist bombing or a tree growing too close to a high tension line. From
just the results you can't tell the cause. It may have been possible for
an outside group, with no cooperation from the power companies, to have
discovered the blackened tree on the utility right of way. But without the
utility's logs and access to their data, I think it would have been very
difficult for an outside group to analyze the failure. In particular I
think it would have been close to impossible for an outside group to
find the other contributing factors.
This should go on the name-droppers list, but here goes....
What do we know about the events with the name servers
- f.root-servers.net was not able to transfer a copy of some of
the zone files from a.root-servers.net
- f.root-servers.net became lame for some zones
- tcpdump showed odd AXFR from a.root-servers.net
- [fjk].gtld-servers.net have been reported answering NXDOMAIN to
some valid domains, NSI denies any problem
Other events which may or may not have been related
- BGP routing bug disrupted connectivity for some backbones in the
- Last month the .GOV domain was missing on a.root-servers.net due
to a 'known bug' affecting zone transfers from GOV-NIC
- Someone has been probing DNS ports for an unknown reason
Things I don't know
- f.root-servers.net and NSI's servers reacted differently. What
are the differences between them (BIND versions, in-house source
code changes, operating systems/run-time libraries/compilers)
- how long were servers unable to transfer the zone? The SOA says
a zone is good for 7 days. Why they expire/corrupt the old zone
before getting a new copy?
- Routing between ISC and NSI for the preceeding period before the
problem was discovered
- Network connectivity was insufficient between NSI and ISC for long
enough the zones timed out (why were other servers affected?)
- Bug in BIND (or an in-house modified version) (why did vixie's and
NSI's servers return different responses?)
- Bug in a support system (O/S, RTL, Compiler, etc) or its installation
- Operator error (erroneous reports of failure)
- Other malicious activity?