Netcom Outage (Was: My InfoWorld Column About NANOG)

Having a fully meshed/redundant network should be the goal of any serious
ISP. The only one that claims it with any substance IMO is UUNET.

The "full mesh" in this case is a figment of imagination. Level-2
mapping of cirtuits over the same (non-meshed) physical wires
does not make network a tiny weeny bit more reliable, and the added
complexity brought by such mapping actually makes system less

Other ISPs who happen to be in position to control physical
routing of circuits use IP-level rerouting to attack the problem.

The IGP rerouting with modern link-state IGPs is sub-second
so the redundancy of interior paths is easy to achieve (perticularly
if you use tricks like BGP confederations which eliminate
need to recompute iBGP routing in case of IGP changes).

The hard part is exterior routing where topology changes require
massive crunching of BGP tables. Multiplying paths actually
makes the problem worse.

We are trying to build one and its not easy. Haveing redundant links
in place does not guarantee instant fall over of traffic. Static routes, IGRP,
iBGP, bridgeing, rip1 vs rip2, etc. are some of the issues we are running

The Golden Rule of engineering (often forgotten in US) -- the
simplier the system is the better it works. The root of many
Internet woes is in overly complicated router software, that
complexity appears to be running out of control.

An ISP engineer's nightmare is a Bysantine-mode failure -- when
redundancy does not help because a problem in one place triggers
failures in a lot of other places. Any RISKs reader knows that
software problems in distributed systems often have that
nature. Internet configurations are particularly prone to that,
especially considering that they're in state of constant flux.
That's why draconian revision controls and highly skilled backbone
engineering staff (which is actually able to understand global
consequences of all their actions) are vital to operations of any
serious ISP. Netcom seems to be in stage of learning that hard

As well as when an interface is down, but actually looks up to the
router, can be done, but there are so many possible points of
failure and unforseen scenarios, it is very difficult to construct and
certainly takes time to develop.

That example is a perfect illustration on why many backbone engineers
are sceptical about "advanced" level-2 technologies. Even good ol'
Ethernet may have numerous quite interesting ways to fail, if not
built properly (remember ol' MAE-E? :slight_smile: