History: lengthy outages

Sean_Donelan · January 25, 2001, 7:45am

That's a bit unfair.

There have been a number of lengthy outages.

AS7007 router configuration problem: April 25 1997 lasted 2 hours
AOL (ANS router configuration problem): Aug 7 1996 lasted 19 hours
ATT frame-relay switch errors: April 13, 14 1998 lasted 26 hours
BBN standard power failure: October 11, 1996 lasted about 12 hours(off and on)
NETCOM router configuration error: June 20 1996 lasted 13 hours
Sprint database problems: September 3 1996 lasted 5 hours
NSI root server corruption (operational error): July 16 1997 lasted 4 hours
PacBell configuration problems: January 30-31 1997 lasted 48 hours
UUNET frame-relay problems: July 1 1997 lasted over 24 hours
UUNET cisco/bay router problems: November 7 1997 lasted 5 hours
Worldcom frame-relay switch errors: August 1999 lasted 9-10 days

If I skipped your favorite provider, have no fear. More than likely
they've also had (or will have) a lengthy outage at one time or another.
One interesting thing I found was providers seem to have substantial
outages shortly after announcing "100% Uptime."

But somehow the net continues to stumble forward.

Vijay_Gill1 · January 25, 2001, 8:07am

Nov 7/8 1998.

/vijay "remembering entirely too well" gill

Clay_Fiske · January 25, 2001, 8:39am

Your point isn't lost on me, but I think there are a couple of
distinctions to make here. I do freely admit, however, that I'm
not familiar with all of the outages you listed.

1. I think it might be prudent to weed out the 2-5 hour outages here.
While that's still an excessively long time to recover a change that
should have been monitored and tested properly in the first place
(and still probably cause for firing in some shops), I can at least
conceive of it taking this amount of time. Too long, yes, but not
quite in the jaw-dropping category.

2. Several of the remaining group that I'm familiar with (namely the
AT&T and Netcom outages) involved problems which cascaded out to the
entire network, and therefore could not be simply undone by backing
out the change on that router/switch/etc. I've been through an outage
or two which didn't gain such notoriety but still took several hours
after identifying the problem just to go out and reboot enough boxes
to settle the network down. In the MS DNS case, I don't feel the same
point applies. By most accounts, it appears the issue was a change to
a single router affecting one or two subnets, and reversing the issue
was simply a matter of backing out said change.

Again, don't get me wrong. Your point is taken, and I've certainly
caused and felt my share of pain with (sometimes unnecessarily)
lengthy outages. I agree with Randy about cutting the folks involved
some slack, since we've all been and will be there at some point. I
do see it the other way as well. This amount of time to resolve a
local issue caused by a procedurally implemented change (I'll give
them the benefit of the doubt) on a critical network device is
surely due some scrutiny.

-c

Bandy_Rush1 · January 25, 2001, 10:20am

UUNET cisco/bay router problems: November 7 1997 lasted 5 hours

/vijay "remembering entirely too well" gill

if this is 129/8, then i believe that major providers were down for almost
two days.

randy