BBN outage

Most of the SF Bay Area portion of BBN Planet's network has been offline
since late last night. I've finally heard, via a friend who is a major
customer of theirs, that they are having power problems in Palo Alto.

Now, I haven't heard a peep about this on any of the mailing lists I'm on...
and since knowing about BIG outages like this would make it easier for
me to answer my customer's questions... am I missing something?

Is there a mailing list that I should be on? Or is this just a private
BBN issue that I'd need to call their ops center people about (people who
I'm sure are very very busy without small network operators calling them)

-matthew kaufman
matthew@scruz.net

I just spoke to their NOC and was told that a power switch that is
supposed to be able to switch them between 3 different power utilities
failed at 12:30 this morning (friday). They bought and installed 2
large diesel generators today and are hoping to be back up using the
generators within 30 minutes.

Rob

Perhaps there is a lesson in redundancy here. Based on this account it
would appear that although they had arranged for 3 power sources that
there were two mistakes made. One was that switching between sources
relied on a single piece of hardware thus negating the redundancy of 3
sources. The other was that they did not have alternate types of power
source, i.e. they did not have generators on site but expected that 3
different utility power sources was adequate redundancy.

Some NOC engineers (or perhaps NOC managers) aren't paranoid enough.

Michael Dillon - ISP & Internet Consulting
Memra Software Inc. - Fax: +1-604-546-3049
http://www.memra.com - E-mail: michael@memra.com

Yep this happens all the time, the transfer switch dies and then you are
screwed because you don't switch to backup power. Your UPS system then run
out of power and you are dead. That is why we are building a manual
maintenance wraparound around the UPS AND the transfer switch so that if
they switch does die you can manually have some guy bypass the switch.

Nathan Stratton CEO, NetRail, Inc. Tracking the future today!

> I just spoke to their NOC and was told that a power switch that is
> supposed to be able to switch them between 3 different power utilities
> failed at 12:30 this morning (friday). They bought and installed 2
> large diesel generators today and are hoping to be back up using the
> generators within 30 minutes.

Greetings,

I don't understand how they could have 3 different power systems fail
without a serious operations procedural error.

No one buys more than one generator and it appears that they didn't
have one at all so, that's ruled out. They have one utility company,
and let's say two UPS's to feed the dual power bussed routers. At best
they have a DC plant.

Overloaded UPS's are easy to spot. Weak batteries are found during
routine, in-service tests.

So.. It looks like they abhorently ignored thier power situation. This is
an "act of stupidity", not an "act of god" as one other person mentioned.

Yep this happens all the time, the transfer switch dies and then you are
screwed because you don't switch to backup power.

In 18 years of telecom management I have never seen a "Transfer Switch" as
a component failure. I have seen quite a few overloaded battery strings,
and UPS's backed by rusty generators.

Your UPS system then run
out of power and you are dead. That is why we are building a manual
maintenance wraparound around the UPS AND the transfer switch so that if
they switch does die you can manually have some guy bypass the switch.

All quality Transfer Switches should have manual activation as a root
function. Even a relatively small 15kw transfer switches automatic
functions work to move a manual switch.

As a rule, power system maintenance for critical equipment should comprise
the following:

UPS- Never exceed 80% load. Replace batteries (good or not) per
manufactures guidlines. Initiate load transfer tests once per quarter.

Batteries- Preform cell maintenance quarterly, to include individual cell
voltage and gravity tests and, the surface cleaning of all terminal
hardware.

Generators- Find a good generator maintenance contractor for routine
maintenace needs. Exercise the unit under load each month.

Regards

Patrick J. Chicas
Email: pjc@unix.off-road.com
URL: http://www.Off-Road.com

Greetings,

I don't understand how they could have 3 different power systems fail
without a serious operations procedural error.

They did not, the transfer switch did.

> Yep this happens all the time, the transfer switch dies and then you are
> screwed because you don't switch to backup power.

In 18 years of telecom management I have never seen a "Transfer Switch" as
a component failure. I have seen quite a few overloaded battery strings,
and UPS's backed by rusty generators.

Well this is one, and I have only been on this planet 20 years and have
seen 2 be the failure.

> Your UPS system then run
> out of power and you are dead. That is why we are building a manual
> maintenance wraparound around the UPS AND the transfer switch so that if
> they switch does die you can manually have some guy bypass the switch.

All quality Transfer Switches should have manual activation as a root
function. Even a relatively small 15kw transfer switches automatic
functions work to move a manual switch.

True, all things can break.

As a rule, power system maintenance for critical equipment should comprise
the following:

UPS- Never exceed 80% load. Replace batteries (good or not) per
manufactures guidlines. Initiate load transfer tests once per quarter.

Batteries- Preform cell maintenance quarterly, to include individual cell
voltage and gravity tests and, the surface cleaning of all terminal
hardware.

Also keep keep at or around 75^

Generators- Find a good generator maintenance contractor for routine
maintenace needs. Exercise the unit under load each month.

Nathan Stratton CEO, NetRail, Inc. Tracking the future today!

Well, the original outage was caused by a rat gnawing a cable....
There are always layers of complexity, layers of causation and layers of
blame in a situation like this. You might end up finding that 20 people
were 5% to blame but it just so happened that all 20 made the wrong
mistake at the wrong time to cause 100% failure.

But you do have some good advice here regarding power. Thanks.

Michael Dillon - ISP & Internet Consulting
Memra Software Inc. - Fax: +1-604-546-3049
http://www.memra.com - E-mail: michael@memra.com

Greetings

> I don't understand how they could have 3 different power systems fail
> without a serious operations procedural error.

They did not, the transfer switch did.

I think you have missed my point. By the RISKS message it appears that BBN
relied blindly on whatever power solution Stanford had in place. THis, is
an operational failure.

> In 18 years of telecom management I have never seen a "Transfer Switch" as
> a component failure. I have seen quite a few overloaded battery strings,
> and UPS's backed by rusty generators.

Well this is one, and I have only been on this planet 20 years and have
seen 2 be the failure.

I have seen automatic tranfer mechanisms fail but, never the manual
portion of the switch.

Regards

Patrick J. Chicas
Email: pjc@unix.off-road.com
URL: http://www.Off-Road.com

Greetings

Well, the original outage was caused by a rat gnawing a cable....
There are always layers of complexity, layers of causation and layers of
blame in a situation like this. You might end up finding that 20 people
were 5% to blame but it just so happened that all 20 made the wrong
mistake at the wrong time to cause 100% failure.

And the generator was rusting away next door.

But you do have some good advice here regarding power. Thanks.

My mind was molded early in my career by the Bell System practice.
Pitty me..

Seriously, working in the wireless business with lots of money, and the
ISP business with just enough money to keep pace with growth, I have seen
both sides of the coin. I completely understand the economic constraints
of a small to medium ISP's and their struggle to make ends meet. On the
other hand, I believe the large IAP's (almost all publicly traded
companies) owe their subscribers only the best level of performance.

Patrick J. Chicas
Email: pjc@unix.off-road.com
URL: http://www.Off-Road.com

Somone mentioned that they had seen Stanfords generator plant and it was
all dusty. Is it possible that no one knew how the transfer switch worked
or possibly no one knew that it had a manual override? Is there anyone
close enough to Stanford to check up on this stuff?

Michael Dillon - ISP & Internet Consulting
Memra Software Inc. - Fax: +1-604-546-3049
http://www.memra.com - E-mail: michael@memra.com

The message led me to believe that the generator was not attached to the
transfer switch. Also, almost all transfer switches have a manual handle
on the front that levers the contacts inside the switch box. The automatic
portion is commonly a very, very large relay type winding that moves the
same contacts.

It would be hard to imagine that the rodent did it's thing on the transfer
switch itself which should be upstream from the buildings, main AC
switchgear. Referencing the messages so far, I am led to believe that the
main switch gear smoked and the generator was not attached to any piece of
transfer switch gear.

It's also tough to comprehend why BBN didn't upgrade the hub after
purchase. It's just not that much of a capital expense for them to do so.

I'll bet they make changes now on the order of a dedicated generator and
UPS or battery plant for their Servers, Routers and Terminal equipment.

The stuff just isn't that expensive. As an example; I just had
our third Lorain DC-AC 10kva inverter installed. The total price, turn key
was under $13,000 with AC runs into the equipment areas.

BIG APC UPS's with loads of batteries are not much more and you can plug
them in the wall.

Regards

Patrick J. Chicas
Email: pjc@unix.off-road.com
URL: http://www.Off-Road.com