Cascading Failures Could Crash the Global Internet

Stewart_William_C_Bi · February 9, 2003, 3:39am

I think the key is that the failures described in the paper
are caused by overload rather than other things -
too much demand for power blows out the generator,
and without it, the grid tries to get the power from the next
nearest generators, which overload and fail, and try to pull an
even large amount from the _next_ nearest, etc.
So the bit about heterogeneity is probably referring to
the fact that some nodes are bigger or better-connected than others,
and are more likely to blow out a bunch of their neighbors when
they fail and shed a big load.

That's not really how Internet systems usually fail.
Overload can cause problems, and we've seen congestion collapse
in the past, but TCP is usually tuned to discourage it;
when a system is overloaded, well-behaved applications
(which is most of them) back off, gradually or rapidly,
but unless the load is weird enough to blow out
router CPUs or crowd out BGP and OSPF packets,
usually the network itself stays up and running.
If what's failing is an overload of BGP routes or something,
that's different - and sometimes the load on the system shrinks
as components fail, but sometimes that just makes everything
flap all at once, increasing load and delaying convergence.

David_Barak · February 9, 2003, 4:36am

--- "Stewart, William C (Bill), SALES"

If what's failing is an overload of BGP routes or
something,
that's different - and sometimes the load on the
system shrinks
as components fail, but sometimes that just makes
everything
flap all at once, increasing load and delaying
convergence.

I seem to recall a massive routing failure in October
which was caused by BGP getting imported to a major
ISPs IGP...

The core ${VENDOR 1}routers were able to handle the
influx of routes, but the edge ${VENDOR 2} routers
could not handle the influx - so the failure didn't
exactly cascade, but did more of a ripple. However,
the reloading of all of the edge devices increased the
BGP instability.

-David Barak
-fully RFC 1925 compliant-

Jack_Bates · February 9, 2003, 3:07pm

I think the key is that the failures described in the paper
are caused by overload rather than other things -
too much demand for power blows out the generator,
and without it, the grid tries to get the power from the next
nearest generators, which overload and fail, and try to pull an
even large amount from the _next_ nearest, etc.
So the bit about heterogeneity is probably referring to
the fact that some nodes are bigger or better-connected than others,
and are more likely to blow out a bunch of their neighbors when
they fail and shed a big load.

That's not really how Internet systems usually fail.

A prime example of this theory was the large network I was using back when
IE5 first came out. They had one circuit bad which overloaded an ATM circuit
at another NAP causing it to generate bit errors. Shutting down the second
circuit overloaded both MAE circuits effectively shutting down the network.
However, it required manual intervention to create full failure, otherwise
TCP would pull back to being useless, effectively killing all connections
going that path, but not causing an issue with other paths until the manual
intervention of shutting down the cirucit.

While in theory it was still a cascade failure, it was also poor
planning/policy on the part of the network to not be able to compensate in
case of failure. The information provided may be partially inaccurate and is
only hearsay concerning actual outages and effects when various
interventions were tried; no hard fact. Thus it could be taken as solely my
conjecture and not actual fact.

-Jack

Marshall_Eubanks3 · February 9, 2003, 5:35pm

Hello;

A packet switched network can be engineered against cascading failures
in a way that's hard for a circuit switched network. Every time you see a
random wait in a protocol, it's a good bet that the protocol writers were trying to
protect against the tight coupling that leads to cascading failures.

Regards
Marshall Eubanks