San Francisco Power Outage

Just a heads up to anyone on list that PG&E has just sustained a large
outage in San Francisco that has caused a few hiccups (both network,
electrical, infrastructural, etc.) around the city.

I've confirmed that both customers in 365 Main and parts of telecom 1
have both sustained brief blackouts. No word yet form 200 Paul.

Anyone in the area that could use a hand with anything, I'll probably
be wrapping up fixes for my stuff soon, and would be glad to help
however I can.

Cheers,
jonathan

Jonathan Lassoff wrote:

Just a heads up to anyone on list that PG&E has just sustained a large
outage in San Francisco that has caused a few hiccups (both network,
electrical, infrastructural, etc.) around the city.

I've confirmed that both customers in 365 Main and parts of telecom 1
have both sustained brief blackouts. No word yet form 200 Paul.

Anyone in the area that could use a hand with anything, I'll probably
be wrapping up fixes for my stuff soon, and would be glad to help
however I can.

I have a question: does anyone seriously accept "oh, power trouble" as a reason your servers went offline? Where's the generators? UPS? Testing said combination of UPS and generators? What if it was important? I honestly find it hard to believe anyone runs a facility like that and people actually *pay* for it.

If you do accept this is a good reason for failure, why?

~Seth

Didn't you read? He paid extra for super-reliable power from his
electricity provider..

Adrian

Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought a few months ago, has weekly full-on generator tests where 100% of the load is transferred to the generator, while apparently large DCs that are charging premium rates, do not.

Cordially

Patrick Giagnocavo
patrick@zill.net

I’m unable to find a link at the moment, but many moons ago power was lost at the 350 E Cermak Equinix facility in Chicago. At the time, we didn’t have production equipment there (only a firewall in a shared colo cage/cabinet). This occured on a Friday evening and lasted for quite some time into Saturday morning because their generators would start up but would refuse to continue running. I believe the root cause was a problem related to insulation on the power cables somewhere. I understand testing is done frequently, but I’m also aware that if I want full redundancy, I’m going to have two physically separate locations. There are some events you can’t plan for, as well as failure modes that aren’t easily/quickly resolved.

-brandon

sethm@rollernet.us (Seth Mattinen) writes:

I have a question: does anyone seriously accept "oh, power trouble" as a
reason your servers went offline? Where's the generators? UPS? Testing
said combination of UPS and generators? What if it was important? I
honestly find it hard to believe anyone runs a facility like that and
people actually *pay* for it.

If you do accept this is a good reason for failure, why?

sometimes the problem is in the redundancy gear itself. PAIX lost power
twice during its first five years of operation, and both times it was due
to faulty GFI in the UPS+redundancy gear. which had passed testing during
construction and subsequently, but eventually some component just wore out.

They should have generators running...I can't foresee any good
datacenter not having multiple generators to keep their customers
servers online with UPS.

-Ray

Perhaps they do. Wouldn't have mattered in this case if the
big-red-button rumor is real. :wink:

-Jim P.

We also have weekly backups where 100% of the load for our entire
company is put on the three generators. Everything inside the building
is put onto the generators power, this way we can test for faulty UPS's
etc and ensure the generators are working etc. I don't believe that they
don't have a similar setup.

Ray Corbin

rcorbin@hostmysite.com

I am not familiar with the operational details of 365 Main, but, I suspect that
they, like most datacenters, probably do have weekly generator and transfer
test procedures.

However, there are lots of things that can go wrong that are not covered by
generators and transfer tests:

It is possible to cascade fail a power distribution system in a number of
ways. It is possible for someone to connect things out of phase during a
maintenance procedure in such a way that everything is fine until a
transfer occurs, then, all hell breaks loose (ever seen what happens
when a large CRAC unit starts trying to run backwards because the
3 Phase rotation is out of order?)

There are also things that can go wrong in the transfer process (like
putting the UPS and Generators on the bus together some degrees
out of phase).

Most of these things become far more likely and far harder to avoid as
the amount of power and the number of units in the system increases.

I'm not defending the situation at 365 Main. I don't have any first hand
knowledge. I'm just saying that the mere fact that they are dark for
several hours today does not necessarily mean that they don't do
weekly full-on generator tests.

I have no idea what the root cause of today's outage is. I will be
interested in hearing from any credible source as to any actual details,
but, I'm betting that right now, any such credible source is a bit busy.

Owen

I have a question: does anyone seriously accept "oh, power trouble"
as a reason your servers went offline? Where's the generators? UPS?
Testing said combination of UPS and generators? What if it was
important? I honestly find it hard to believe anyone runs a
facility like that and people actually *pay* for it.

Sad that the little Telcove DC here in Lancaster, PA, that Level3 bought
a few months ago, has weekly full-on generator tests where 100% of the
load is transferred to the generator, while apparently large DCs that are
charging premium rates, do not.

There's graceful startup testing, then there's dark start testing. During a recent dark start test one of the other customers in the facility I'm in found out their Juniper was not even plugged into their batteries.

Well, the fact still remains that operating a datacenter smack-dab in
the center of some of the most inflated real estate in recent history
is quite a castly endeavor.
I really wouldn't be all that surprised if 365 Main cut some corners
here and there behind the scenes to save costs while saving face.

As it is, they don't have remotely enough power to fill that facility
to capacity, and they've suffered some pretty nasty outages in the
recent past. I'm strongly considering the possibility of completely
moving out of there.

--j

365 I believe has flywheels...from what I'm gathering it wasn't a
full building outage. Static switch issues again, anyone? Either
way, happy I moved out of there. It was overpriced for when it was
working.

I hear they had a scheduled power outage for maintenance this coming
weekend. I'll give benefit of doubt and assume it was for something
else, not that they knew they had an issue and had their fingers
crossed[1]

On a related note - one of my clients came to within 5 minutes of
the DC UPSs running out today before power came back. Generator
truck was still en-route, but hey power's back! So they cancel it.
*sigh*

John
1: ...but not crossed tight enough.

But as George mentions... Sh*t happens.... There are things you can't
forsee, or maybe spend way too much engineering to overcome that 1
in a million "oops". I've been at Telehouse 25B a few times when
the "I never expected something like that would happen" happened.
(I remember two guys with VERY LONG screwdrivers poking a live transfer
switch to get it to reset properly, and was told to step back 20 feet as
thats how far they expected to get thrown if they did something wrong).
(I also remember them resetting the switch, then TRIPPING it again just
to make sure it could be reset again!)

      Tuc/TBOH

And I could tell you about large DC's that are charging premium
rates, had (admittedly) quarterly generator tests that ended up failing
and causing down time MULTIPLE TIMES too. Meanwhile the generator at my
parents house I had installed has weekly tests and runs fine, but I'm
waiting for that unbelievably cold unbelievably harsh winters day where
the power goes out and the generator fails... Because its a machine. It
has wear, it breaks. I don't know that I'd be comfortable with a full
load every time. Rather it be load banks....

      Tuc/TBOH

> Sad that the little Telcove DC here in Lancaster, PA, that Level3
> bought a few months ago, has weekly full-on generator tests where
> 100% of the load is transferred to the generator, while apparently
> large DCs that are charging premium rates, do not.

Perhaps they do. Wouldn't have mattered in this case if the
big-red-button rumor is real. :wink:

Also, doing a "full-on generator test where 100% of the load is transferred"
is not always the best option.

Unneeded usage of VRLA's only shortens their lives.

It appears that 365 is using the Hytec Continuous Power System [ http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don’t use batteries.

If the flywheels spent their energy before the generators came online, they don’t have the ability to start the generators up without utility power (unless they purchased the Dark Start option, which is simply extra batteries).

-brandon

It is not as exciting as valleywag suggests.

--- cut here --

Hello,

The Internap NOC has confirmed with 365 Main at approximately 13:50 PDT, they experienced a loss of utility power to their San Francisco facility. The facility's backup generators did not automatically react and failover upon the loss of utility power, resulting in loss of power to numerous customer cabinets.

At 14:24 PDT our logs show customers circuits were restored as 365 Main facility was able to bring their backup generators online. 365 Main will continue to run on backup generators until they are confident it is safe to run on utility power again. Internap will continue to follow up with 365 Main throughout the evening for updates concerning this event. As we receive additional updates we will be sure to relay them on to your team.

Internap will continue to track this event under ticket 243443. Again we apologize for any inconvenience this may have caused your team.

If you have any questions or concerns please contact us at noc@internap.com or call 877-843-4662.

Thank you.

Ahhh, a trip down memory lane :slight_smile:

The ISP I used to work at had a small ping-and-power colo space, and we also housed a large dial/DSL POP in the same building. A customer went in to do hardware maintenance on one of their colo boxes. Two important notes here:

1. The machine was still plugged in to the power outlet when they decided to do this work.
2. They decided to stick a screwdriver into the power supply WHILE said machine was plugged into said power outlet. I guess those "no user serviceable parts inside" warning labels are just friendly recommendations and nothing more...

While the machine was fed from a circuit that other colo customers were on, the breaker apparently didn't trip quickly enough to keep the resulting short from sending the 20 kva Liebert UPS at the back of the room into a fit. It alarmed then shut down within 1-2 seconds of this customer doing the trick with the screwdriver. This UPS also fed said large dial and DSL POP. Nothing quite like the sound of a whole machine room spinning down at the same time. It gives you that lovely "oh shit" feeling in the pit of your stomach.

I do remember fighting back the urge to stab said customer with that screwdriver...

jms

It appears that 365 is using the Hytec Continuous Power System [

<http://hitec.pageprocessor.nl/p3.php?RubriekID=2016&gt;
http://hitec.pageprocessor.nl/p3.php?RubriekID=2016], which is a motor,
generator, flywheel, clutch, and Diesel engine all on the same shaft. They
don't use batteries.

If the flywheels spent their energy before the generators came online, they

don't have the ability to start the generators up without utility power
(unless they purchased the Dark Start option, which is simply extra
batteries).

-brandon

They claim in the video tour that they do not have any battery systems on
the site. They rely solely on the flywheels.

Randy