Journal of Internet Disasters

eric.CArroll@acm.ORG (Eric M. Carroll) writes:

Do we actually need the cooperation of the organizations in question to
effect this?

Yes and no. It would be fairly 'easy' to become a editor, start Donelan's
Journal Of Internet Disasters, get a number of noted experts to contribute
articles analyzing failures with no cooperation of the organizations. But
I can predict what the organizations in question would say about such an
endeavor:

   1) Donelan is engaging in FUD to sell his journal.
   2) They are making rash assumptions without knowing all the facts.
   3) You know the Internet, you can't please everyone. Its just a
  small group of people with an axe to grind.
   4) It didn't happen. If it did happen, it was minor. If it wasn't
  minor, not many people were affected. If many people were
  affected, it wasn't as bad as they said. If it was that bad,
  we would have known about it. Besides we fixed it, and it
  isn't a problem (anymore).

Sure, sometimes a problem breaks through to the public even when the
company tries all those things. Just ask INTEL's PR department about
their handling of the Pentinum math bug. But that is relatively rare,
and not really the most efficient way to handle problems.

For large enough failures, the results are obvious and the data is fairly
clear. Perhaps a first stage of a Disruption Analysis Working Group would
simply be for a coordinated group to gather the facts, sort through the
impact, analyze the failure and report recommendations in a public forum.

I'm going to get pedantic. The results may be obvious, but the cause isn't.
I would assert there are a number of large failures where the initial obvious
cause has turned out to be wrong (or only a contributing factor). Was the
triggering fault for the western power grid failure last year caused by a
terrorist bombing or a tree growing too close to a high tension line. From
just the results you can't tell the cause. It may have been possible for
an outside group, with no cooperation from the power companies, to have
discovered the blackened tree on the utility right of way. But without the
utility's logs and access to their data, I think it would have been very
difficult for an outside group to analyze the failure. In particular I
think it would have been close to impossible for an outside group to
find the other contributing factors.

This should go on the name-droppers list, but here goes....

What do we know about the events with the name servers

   - f.root-servers.net was not able to transfer a copy of some of
  the zone files from a.root-servers.net
   - f.root-servers.net became lame for some zones
   - tcpdump showed odd AXFR from a.root-servers.net
   - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
  some valid domains, NSI denies any problem

Other events which may or may not have been related
    - BGP routing bug disrupted connectivity for some backbones in the
  preceeding days
    - Last month the .GOV domain was missing on a.root-servers.net due
  to a 'known bug' affecting zone transfers from GOV-NIC
    - Someone has been probing DNS ports for an unknown reason

Things I don't know
    - f.root-servers.net and NSI's servers reacted differently. What
  are the differences between them (BIND versions, in-house source
  code changes, operating systems/run-time libraries/compilers)
    - how long were servers unable to transfer the zone? The SOA says
  a zone is good for 7 days. Why they expire/corrupt the old zone
  before getting a new copy?
    - Routing between ISC and NSI for the preceeding period before the
  problem was discovered

Theories
    - Network connectivity was insufficient between NSI and ISC for long
  enough the zones timed out (why were other servers affected?)
    - Bug in BIND (or an in-house modified version) (why did vixie's and
  NSI's servers return different responses?)
    - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
    - Operator error (erroneous reports of failure)
    - Other malicious activity?

This should go on the name-droppers list, but here goes....

these days it's not clear whether namedroppers is an operations list
or a protocol list or still both. i think nanog is a fine forum for this:

What do we know about the events with the name servers

   - f.root-servers.net was not able to transfer a copy of some of
  the zone files from a.root-servers.net
   - f.root-servers.net became lame for some zones

just COM.

   - tcpdump showed odd AXFR from a.root-servers.net

just a lot of missed/retransmitted ACKs.

   - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
  some valid domains, NSI denies any problem

the nanog archives include some dig results that are hard for NSI to deny.

Other events which may or may not have been related
    - BGP routing bug disrupted connectivity for some backbones in the
  preceeding days

this turned up a performance problem in BIND's retry code, btw, but was
not otherwise related to the COM lossage of yesterday (as far as i know).

    - Last month the .GOV domain was missing on a.root-servers.net due
  to a 'known bug' affecting zone transfers from GOV-NIC

different bug. that one causes truncated zone transfers; the secondary
zone files on [fjk].gtld-servers.net yesterday were not truncated and it
just took a restart to make them stop behaving badly.

    - Someone has been probing DNS ports for an unknown reason

Things I don't know
    - f.root-servers.net and NSI's servers reacted differently. What
  are the differences between them (BIND versions, in-house source
  code changes, operating systems/run-time libraries/compilers)

they are completely different systems (solaris vs. digital unix) running
the same (unmodified) bind 8.1.2 sources, which had completely different
failure modes for completely different reasons.

    - how long were servers unable to transfer the zone? The SOA says
  a zone is good for 7 days. Why they expire/corrupt the old zone
  before getting a new copy?

damn good question. i'll look into that. shouldn't've happened.

    - Routing between ISC and NSI for the preceeding period before the
  problem was discovered

there was asymmetry (they reached me via bbnplanet, i reached them via
alternet). they are now preferring alternet to reach me, so we have
better path symmetry now. but their first mile is still congested and
i am still retransmitting a lot of ACKs.

Theories
    - Network connectivity was insufficient between NSI and ISC for long
  enough the zones timed out (why were other servers affected?)

other servers are more conservative, and had switched to manual daily FTP
of the COM zone longer ago than F has done. (with manual daily FTP you
get the advantages of gzip, and of the pretense of "zone master" status
while you manually retry after timeouts. AXFR needs those properties.)

    - Bug in BIND (or an in-house modified version) (why did vixie's and
  NSI's servers return different responses?)

there's definitely a bug in BIND if [fjk].gtld-servers.net were able to
return different answers after restarts with no new zone transfers. (i'm
sitting here wishing i had core dumps.)

    - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
    - Operator error (erroneous reports of failure)
    - Other malicious activity?

i think there were a goodly number of procedural errors.

I think this is an operationally relevant thread, so let me continue to tilt
at windmills here. I like your ideas (as usual) and I think there is a
executable idea here. I firmly believe something in this area is much, much
better than nothing, which is what we have now.

So, here's three communal options:

- constitute a mailing list for failure analysis, everyone pitches in with
or without assistance. The simple act of analyzing the options and possible
failure modes is of value (note the reaction from Paul to your mail
message - thus value is demonstrated!)
- constitute a closed mailing list, by invitation only. Ask vendors for
cooperation, and publish the results with the names removed to protect the
guilty and ensure their cooperation. Publish their names if cooperation is
refused.
- created a moderated digest list, IFAIL-D, and take input from anywhere,
but vet it through a panel of experts for analysis and publication. That's
basically your newsletter.
- create a real working group that meets and travels, and visits the vendors
in person. Perhaps they get badges eventually, or cool NTSB like jackets :wink:

So, I will jump into the pool if you will. Let's pick a model and try... The
point is, there is alot of expertise available. I think starting small,
involving experts, being professional, using volunteers and growing as
required is a model that has worked many times in Internet Land for some big
pieces of infrastructure. In other words, we need to prove the value before
people will pay for it. Have we acquired so much operational grey hair we
have forgotten our roots? (sorry for the pun).

Regards,

Eric Carroll

Yes and no. It would be fairly 'easy' to become a editor, start Donelan's
Journal Of Internet Disasters, get a number of noted experts to contribute
articles analyzing failures with no cooperation of the organizations. But
I can predict what the organizations in question would say about such an
endeavor:

Your predictions are wrong however they would be true if this journal was
edited by someone other than yourself. You have a significant amount of
credibility in the industry and if you did edit such a journal, then it
would be taken seriously.

I'm going to get pedantic. The results may be obvious, but the cause isn't.
I would assert there are a number of large failures where the initial obvious
cause has turned out to be wrong (or only a contributing factor).

This is a prime example of why your cerdibility in regard to disaster and
disruption analysis is so high. You not only have the background knowledge
to understand it and the willingness to research the things you don't
know. but you also have the right sceptical attitude that does not stop
questioning the situation just because a nice answer has arrived.

difficult for an outside group to analyze the failure. In particular I
think it would have been close to impossible for an outside group to
find the other contributing factors.

As an editor of a network outages journal, you wouldn't be expected to do
all the investigative legwork yourself. But I think that your evenhanded
treatment of the events would tend to draw out internal investigation
reports of the companies involved. I think that you could run such a
journal in a way that would largely evade the negative effects that people
fear from disclosure because of your ability to draw parallels with
disaster situation in other industries.

    - Last month the .GOV domain was missing on a.root-servers.net due
  to a 'known bug' affecting zone transfers from GOV-NIC
    - Someone has been probing DNS ports for an unknown reason

- it is known that various individuals flood the Internic with packets
related to aatempts to suck down the whois database, one item at a time
and/or detect when a specific domain name goes off hold and becomes
available for re-registration

- pathshow indicated that the Internic circuit over which AXFR was being
attempted was congested.

    - f.root-servers.net and NSI's servers reacted differently. What
  are the differences between them (BIND versions, in-house source
  code changes, operating systems/run-time libraries/compilers)

Whatever was causing the Internic link to be congested could have
disrupted NSI's server. Wasn't vixie's server acting properly by answering
lame for the zones it could not retrieve? It seems like all the problems
revolve around NSI's server and network. Vixie's problems were merely a
symptom. On the other hand, I would classify the inability of AXCFR to
transfer the zone as a weakness in BIND that could be addressed.
Additionally, since it is known that zone transfers require a certain
amount of bandwidth, Vixie could improve his operations by implementing a
system that monitors the bandwidth with pathshow prior to intiating AXFR.
Also, he could monitor the progress of the AXFR and also alarm if it was
taking too long. This would have allowed a fallback to ftp sooner and
operationally, such a fallback might even be something that could be
automated. Of course, none of this means Vixie was at fault and I'd argue
that NSI is at fault for not being able to detect the problem
sooner and not being able to swap in a backup server sooner. Vixie knows
that he is one of 13 root nameservers. But NSI knows that they are the one
and only master root nameserver which puts more responsibility on them.

There have been no even remotely logical claims that f.root-servers.net
caused any problems at all. If Paul's server had been working correctly
and had transferred the zone properly, the impact of NSI's screwups would
have been almost exactly the same.

What you are discussing is a problem, but not "the" problem and not a
problem that causes a significant impact over the short term.

It is important to keep that clear in messages; NSI has already spread
enough lies, so any confusion about the issue isn't wise.

In fact, the fact that at least three of NSI's servers were giving false
NXDOMAINs isn't really the issue either, from nanog's perspective. It
needs to be figured out, is a major problem in BIND, etc. but isn't
necessarily something they could have or should have been able to prevent
before it happened: that is very difficult to figure out from the outside,
and I can certainly imagine situations where, despite the best operations
anywhere, they could not predict such things.

The big issue that needs to be addressed is why the heck it took NSI over
two hours after they were notified to fix it, especially in the middle of
the day, and why the didn't have any automated system that detected it and
notified them in minutes. Whatever the exact problem was is important and
needs to be addressed, but addressing each instance is pointless without
knowing why NSI's operations procedures are so flawed. In fact, they are
so flawed that the VP of engineering either had no idea what was going on
or chose to lie.

The problem is that NSI currently has no accountability (not even to their
customers), and doesn't even make a token effort to followup to their
screwups.

The organization that controls the root nameservers should have one of the
best operations departments, not one of the worst.

What you are discussing is a problem, but not "the" problem and not a
problem that causes a significant impact over the short term.

What I'm getting at is that on a network you cannot simply point the
finger at the bad guys, NSI and say that since they screwed up everything
is their fault. Everyone who interacts with NSI's servers also has a
responsibility to arrange their operations so that an NSI problem cannot
cause cascading failures. Especially so since NSI is known to regularly
screw up like this.

That means that the other root nameserver operators have a responsibility
to limit the damage that NSI can do to them. You will also note that some
ISPs attempt to mitigate the damage by running their own root zones which
allows them to fix things without waiting for the NSI bureaucracy to get
around to fixing their servers.

It is important to keep that clear in messages; NSI has already spread
enough lies, so any confusion about the issue isn't wise.

Nevertheless, there are other lessons to be learned from the incident
besides the fact that NSI's internal operations are a mess.

The big issue that needs to be addressed is why the heck it took NSI over
two hours after they were notified to fix it,

Precisely! Part of NSI's problem is that they simply do not have the
skilled professionals available to build a proper robust architecture.
This is evident not only in their nameserver operations but also in the
domain name registry as well. But NSI also suffers from the bureaucratic
disease that does not give front-line people the authority and the
responsibility to fix things fast.

The organization that controls the root nameservers should have one of the
best operations departments, not one of the worst.

The solution to this problem is to take this operational responsibility
away from NSI. And then to run it totally transparently so that if a
problem like this occurred there would be no veil of secrecy. IN such an
important infrastructure operation, every detail of the event logs
complete with names and dates and times and the content of internal email
messages should all be open to the public. This would be a very positive
outcome of the new ICANN and would, in fact, be a resurrection of the way
things used to be done on the net where everyone shared their data openly
and jointly figured out how to do things better.

This thread has mostly looked at the details of the recent problem, and
hasn't responded much to Sean's original points. A very notable exception
is Eric's thoughtful consideration of the approaches that might be taken
for a discussion forum.

The note about Sean's credibility obviously is also relevant, but I'll note
that the recent DNS controversy has made it clear that no amount of
personal credibility is enough to withstand a sustained and forceful attack
by a diligent and well-funded opponent. Hence, the effort under
discussion, here, needs a group behind it, not just an individual. Which
is not to say that having it led by a highly credible individual isn't
extremely helpful.

In considering the possible modes that Eric outlines, the two questions I
found myself asking were about openness and control. Is it important that
the general public be kept out of the analysis and reporting process, as is
done for CERT, or is it important (or at least acceptable) that the public
be present? With respect to control, should the discussion be subject to
control by an authority or should it be free-form?

- constitute a mailing list for failure analysis, everyone pitches in with
or without assistance. The simple act of analyzing the options and possible
failure modes is of value (note the reaction from Paul to your mail
message - thus value is demonstrated!)

This is the open/no-control model. It is the best for encouraging a broad
range of opinion. It is the worst for permitting ad hominems, spin control
efforts, etc.

- constitute a closed mailing list, by invitation only. Ask vendors for
cooperation, and publish the results with the names removed to protect the
guilty and ensure their cooperation. Publish their names if cooperation is
refused.

This is probably the best for thoughtful analysis and the worst for
information gathering.

- created a moderated digest list, IFAIL-D, and take input from anywhere,
but vet it through a panel of experts for analysis and publication. That's
basically your newsletter.

Open participation means broad input. Moderation means control over the
emotional, etc. distractions. It also might be quite a bit of effort for
the moderator...

- create a real working group that meets and travels, and visits the vendors
in person. Perhaps they get badges eventually, or cool NTSB like jackets :wink:

The most fun for the participants, expensive, and probably not (yet) necessary.

I've biased the analysis, to show which one I personally prefer, but it's
predicated on having a moderator with the time and skill to do the job. On
the other hand, if we take the event detail analysis that has been mostly
going on for this thread, we find that contributions have been thoughtful
and constructive, so that the job of the moderator would have been minimal.

In essence, the moderator introduces a small amount of delay but adds a
safety mechanism in case the tone would otherwise start getting out of hand.

And now that I've said that, there is a question about timeliness. Does
the analysis need to be able to occur in emergency mode, to get things
fixed, or will these only be post hoc efforts?

d/

Sure, we have a responsibility to mitigate problems from up the pipe.

But We as an industry and consumers need to start demanding - via writing
to the congress people (They will read letters, not email- and they will
listen if enough people contact them), complaints to the FCC and other
involved groups - high standards of quality from the InterNic and other
organizations that the Internet depend on. We are essentially captive to
their screwups, NO MATTER HOW WELL we prepare.

Even the FAA recognizes no matter how good a flight crew, if they get
incorrect information from the Tower, any problems that occur are the
Tower's fault!

-Deb

Ahem, I suspect that you're not a pilot. The last time I went through
ground school, the pilot has ultimate responsibility for the lives and
welfare of all aboard. This includes the responsibility to tell the tower
to buzz-off when it's obvious that they're wrong. Even in controlled
air-space, where control is heavily biased towards the tower, the pilot has
*final* responsibility. If someone dies, the pilot had better have been
there first or have a good reason why they're still around, to explain why
they're still around. If you're wrong, you had better be dead wrong.

Actually, I find your analogy very apropos. To our customers, their
internal systems are most important, after that it is their inter-facility
connections, *then* they consider their connections to everyone else. Some
of us handle their internal and inter-facility connections, as well as
their external connections. When the InterNIC fargs-up who do they call?
Are they going to understand? The bottom-line is that they don't care.
Failure costs them $cash$.
It is our job to shield our customers from this as much as possible. Those
of us who do, will have satisfied customers, vs. those of us who rely on
excuses.

Kind of like an OEM for the Internet? 8) (office of emergency management)
I actually had an idea like this some time ago and went ahead and
registered oem-i.org, maybe it needs to be re-instated?

The Internet is global, rather than subject strictly to US controls.

Much more importantly is that the Internet has done quite well, over the
last 10 years, with LESS government involvement, not more.

That does not mean no oversight. It merely means finding non-governmental
methods of achieving the oversight. I suggest, for example, that a
competent and careful effort of the type Sean is suggesting would go a
long, long way towards helping things, by providing public and clear
explanations of problems. Yes, it is possible that some ISPs would choose
to ignore the public disclosure, but let's worry about that problem after
give simple, public discourse a try. Such an approach has a good track on
the Internet.

d/

Furthermore...I'd be willing to bet that if the FCC or whoever
  got a lot of complaints, they'd form an oversight committee.

  Why not just form one ourselves, making sure that it's answerable
  to the needs of the Internet community?

Let me begin this by saying emphatically that I am _NOT_ proposing to put
any Internet operational procedures under the Official Standards
Infrastructure. (When I did OSI at COS, I always asked my boss that I
understood what the OSI infrastructure is, but I was unclear about the
superstructure. Ian said that he would be as good an example as any.)

In non-IP networking, there are certainly things that go beyond national
boundaries and are operationally critical. RF spectrum allocations, for
example, are a different matter than ISO or ITU protocol development. True,
the UN has no internal means of sending a missile at a transmitter on an
illegal frequency.

To me, there's a reasonably close analogy between the Internet routing
space and the global RF spectrum. There are collisions if people use the
same frequency/prefix (I am _not_ going to get into line-of-sight issues).
Is someone familiar enough with the procedures for dealing with
inappropriate spectrum use to see if there might be any parallels for
Internet operations?

Howard,

who remembers a presentation at AFCEA where an Israeli Air Force general
was asked about the best electronic countermeasures they had found to use
against Soviet-bloc radar. He suggested that very few jammers, range gate
stealers, etc., really compared to a laser-guided bomb down the feed horn
of the antenna.

Yes, but that's done, pending review and investigation. If you can document
properly you will be exonerated. I happen to know of one instance where
this actually happened and both the pilot and the ATC guy were suspended,
pending review. Both were exonerated because the problem was equipment
failure. Otherwise one of those suspensions might have become permanent.

In the air-space, the FAA tends to err on the side of safety, rather than
justice. If there is a doubt, everyone involved gets grounded until the
doubt is removed. It sounds harsh, but that is a main reason that the
air-space has the safety-record that it has.

An aircraft's main function is to get you high enough off the ground such
that the fall will kill you.
    - Pilot's credo.