MS explains

In message <Pine.LNX.4.30.0101242038380.5951-100000@anime.net>, Dan Hollis writ
es:

At 6:30 p.m. Tuesday (PST), a Microsoft technician made a configuration
change to the routers on the edge of Microsoft's Domain Name Server
network.
At approximately 5 p.m. Wednesday (PST), Microsoft removed the changes to
the router configuration and immediately saw a massive improvement in the
DNS network.

So basically, it took microsoft 23 hours to fix a router configuration.

There's a story (possibly apocryphal) about the time that Steinmetz was
called in as a consultant to repair problems with some massive piece of
electrical machinery. After poking around for a while, some staring,
and much thinking, he adjust one screw, and solved the problem. He
then proceeded to write out a bill for $1000.

The company was outraged. "$1000 for adjusting one screw? You're
crazy!"

Steinmetz agreed, took back the bill, and tore it up. He then wrote
out a new bill:

  Adjust one screw $1
  Knowing which screw to adjust $999

Remember the other half of Jim Duncan's post from last night:

  There were clearly some mistakes made, but it is
  also the case that there were a _lot_ of different
  things going on that contributed to the problem or
  complicated its resolution.

He *worked* this problem; this is a first-hand statement, not
conjecture by those who weren't there.

Let me put forth a blatant generalization of my own: *all* major
failures are due to complex causes. The proof is simple: if you're
small and hence presumably clueless (the "mom and pop" ISPs another
poster sneeringly referred to), your problems don't cause major failures
for the rest of the net. If you're big (and hence presumably clueful),
you solve the simple problems quickly and they don't become major
failures. Finding and fixing *the* root cause is hard, when you're in
the midst of a swamp full of other alligators, and you don't know which
one is (currently) biting you in the rear.

I'd love to see a detailed description of what went wrong, and I hope
that those in the know will be allowed to post it or present it in
Atlanta. But I'm willing to wager that it wasn't just (a) a single
router configuration change, (b) brain-damage in Microsoft's DNS code,
(c) malicious activity aimed at Microsoft, (d) RAMEN-induced
misbehavior, or (e) any other single cause.

    --Steve Bellovin, error

I'd love to see a detailed description of what went wrong, and I hope
that those in the know will be allowed to post it or present it in
Atlanta.

was it not don knuth's turing aware lecture which consisted of a detailed
analysys of some recent bugs in his code?

[ On Thursday, January 25, 2001 at 09:30:19 (-0500), Steven M. Bellovin wrote: ]

Subject: Re: MS explains

I'd love to see a detailed description of what went wrong, and I hope
that those in the know will be allowed to post it or present it in
Atlanta. But I'm willing to wager that it wasn't just (a) a single
router configuration change, (b) brain-damage in Microsoft's DNS code,
(c) malicious activity aimed at Microsoft, (d) RAMEN-induced
misbehavior, or (e) any other single cause.

But there was a single root cause that made the problem possible in the
first place.... Analysis of the actual event will no doubt be
interesting to some, but for all their users out there on the Internet
all that matters is that proper deployment of their servers would have
never allowed the situation to occur in the first place.