Anyone got good data comparing the effects on the Net (BGP reachability,
etc) of this weekend's NYC power outage with the effects power outage late
on September 11th.
I'm on a National Academy of Sciences committee looking at how the Internet
fared on 9/11 and we're always in search of good comparative data.
Anyone got good data comparing the effects on the Net (BGP reachability,
etc) of this weekend's NYC power outage with the effects power outage late
on September 11th.
Hello;
To be honest, I did not see any BGP or other routing effects from the
NYC fire (there were problems on Abilene this weekend, but they were due to
a bad router update). My data are presented on
and are fairly coarse-grained, having a 6 hour update cycle.
The 9/11 problems actually came starting on 9/13 and 14 when the battery /
generator power started running out at 25 Broadway. My understanding is that
the biggest problem was the inability to access the facility to refuel.
BGP stability was normal on 9/11. As we know only
the telephone network suffered more whereas internet
remained stable. Their might have been some problems
in the access because of the flash crowd problem.
Just observe closely the slide in the above link.
It covers a period from a period from 8/1 to 9/26 and
there was variation of 40-60 prefixes (between aug and
september), except on 9/11 (there was 100 changes.)
Only 0.1% of the route table was lost.
BGP was more unstable during code red
propagation(http://www.renesys.com/projects/bgp_instability/.)
A quick peek into both the graphs will make one thing
clear: *BGP is robust enough to withstand any extreme
congestion.*
But the question is: what can be an effective
solution for access congestion on days like 9/11?
I am told they had the fuel, but the "Local 3" union worker who was
watching the gauges on the generator misread the dials, and a human
error caused the generator to run bone dry.
The main generator for the 5th floor apparently ran for a while, but the
radiator became clogged with garbage floating aroung in the air, and
therefore couldn't cool itself, and overheated. They shut it down to
prevent it from hurting itself.
I have never seen the final root cause (actually direct cause, we know
what the root cause was) report from Telehouse. Although I can understand
why Telehouse wouldn't want to say what happened.
Between replacing water pumps, reports of contanimation inside and
outside the cooling system, fuel delivery delays, etc I'm not certain
there was a single cause. From the outside there seemed to be multiple
events, each with different direct causes.