Netcom Outage (Was: My InfoWorld Column About NANOG)

The news reports about the outage were that somehow numerous external
routes got into the internal Netcom backbone routing, and the extra load
caused a chain reaction that caused everything to go down.

Apparently it was mainly confined to Netcom's network. Whether by design
or by dumb luck, I don't know. We currently hang off of AGIS in San
Jose, and for about four hours after Netcom came back, we were up and
down. Couldn't tell from here if it was AGIS, MAE West, or what, or if
Netcom coming back had anything to do with it.

I watched this outage from the periphery, and was completely blown away by
the non-reaction to it. Official statements from Netcom (essentially
confirming Bob's numbers above) were quoted on the Reuters newswire, and on
the front page of the San Jose Mercury News Business section the next day
(although the editor played down the impact of it a little, and mixed a
one-hour AOL email outage into the same story and turned it into "outages
affect online services").

On the other hand, Netcom has said essentially nothing to its subscriber
base about the outage. I've seen only a little mention of it around the
net. Am I looking in the wrong places -- or is there no good way to
communicate about these sorts of things yet? (I've signed up to the outage
discussion list, as Sean suggested.)

My impression is starting to be that most Netcom subscribers didn't really
notice the difference between normal Internet operations and the 13-20 hour
outage, and/or didn't have the diagnostic capabilities to be able to tell.
There were technically-oriented folks that could see that something was
going on, but even for them, it was hard to tell what.

I'm wearing two hats for the next set of questions -- the first as
a technical manager for an ISP growing an international backbone, and
the second as someone who's concerned about marketing the Internet
(and my company) to the public.

Can other big parts of the backbone fall down and take 13 (or more) hours
to get back up? Or is the rest of the net engineered more redundantly than
Netcom? Should I build two backbones, each with separate technologies?
Was this a foreshock of the coming Metcalfean Big One, or just lousy
procedures at one of the bigger ISPs?

Inquiring minds want to know. Right now, it appears to be just a few
(thankfully?). And now is the time to develop communications and publicity
strategies for this sort of thing -- along with the engineering to
hopefully prevent them.

Can other big parts of the backbone fall down and take 13 (or more) hours
to get back up? Or is the rest of the net engineered more redundantly than
Netcom? Should I build two backbones, each with separate technologies?

Ask NASA how they do it. Three redundant systems using two separate
technologies. But then look at NASA's downside and compare it to yours. If
Netcom's customers hardly noticed this maybe the dialup market doesn't
care. However, the leased line market is a whole other story and they also
have the technical expertise to understand your backbone engineering and
perhaps pay a higher fee to have that redundancy. This question really
tangles up marketing and engineering concerns together.

Was this a foreshock of the coming Metcalfean Big One, or just lousy
procedures at one of the bigger ISPs?

The bigger they are, the harder they fall. Seems to me that as ISP's and
NSP's get larger, failures will be more spectacular. However, the big one
depends on the ability for failures to propogate from one ISP/NSP to
another and I don't think this is very likely. Partly due to the different
engineering styles and partly due to the diversity of technology deployed.
You have frame relay backbones, ATM fabrics, DS3 meshes with Cisco nodes
and DS3 meshes with Bay nodes.

Up until Netcom, the most spectacular failures I recall seeing over the
past two years were either caused by NAP congestion or backhoes. NAP
congestion is partially a management failure to deploy bigger pipes and
routers and increase the number of NAP's in time to meet the growth in
traffic flow. But it is also self-correcting as some customers migrate to
NSP's with less congestion and management injects capital into their
infrastructure. It seems to be a well understood problem.

But to me, backhoes are the most interesting failure mode. For one, I
don't think that backhoe problems can be eliminated and I think that as
the physical mesh of fibre becomes more finely divided over the geography,
these incidents will increase. And I also don't know of anyone taking
action to protect against these events by building geographic redundancy
into their backbones. This may be partly because NSP's often don't have
any idea where the fibres lie and partly because they want to use a
specific infrastructure like SPRINT and its railway rights of way. The
incident in the Northeast where a backhoe cut a Wiltel(?) fibre bundle
that was carrying critical DS3's leased by all the NSP's in the region
points out how catastrophic this can be.

Michael Dillon ISP & Internet Consulting
Memra Software Inc. Fax: +1-604-546-3049
http://www.memra.com E-mail: michael@memra.com

Having a fully meshed/redundant network should be the goal of any serious
ISP. The only one that claims it with any substance IMO is UUNET. We are
trying to build one and its not easy. Haveing redundant links in place
does not guarantee instant fall over of traffic. Static routes, IGRP,
iBGP, bridgeing, rip1 vs rip2, etc. are some of the issues we are running
into. As well as when an interface is down, but actually looks up to the
router, etc..it can be done, but there are so many possible points of
failure and unforseen scenarios, it is very difficult to construct and
certainly takes time to develop.

/stb

Michael,

...

But to me, backhoes are the most interesting failure mode. For one,I
don't think that backhoe problems can be eliminated

I don't know about total elimination but we're working on it. Sprint is
currently deploying 4-fiber bi-directional SONET rings that will
cause fiber cuts to go virtually unnoticed. Circuits are switched to a
protect channel in about 50 msec after a failure of the primary path.

and I think that as

the physical mesh of fibre becomes more finely divided over the geography,
these incidents will increase. And I also don't know of anyone taking
action to protect against these events by building geographic redundancy
into their backbones. This may be partly because NSP's often don't have
any idea where the fibres lie and partly because they want to use a
specific infrastructure like SPRINT and its railway rights of way. The
incident in the Northeast where a backhoe cut a Wiltel(?) fibre bundle
that was carrying critical DS3's leased by all the NSP's in the region
points out how catastrophic this can be.

  Again, this may not be totally eliminated. However, we are working to
provide as much physical path diversity as possible.

          Jim Steinhardt
      SprintLink Engineering