Why do we use facilities with EPO's?

I was complaining to some of the power designers during the building
of a major facility that the EPO button represented a single point
of failure, and effectively made all of the redundancy built into
the power system useless. After all, what's the point of having
two (or more) of anything, if there's one button somewhere that
turns it all off?

What I found interesting is that a single EPO is not a hard and
fast rule. They walked me through a twisty maze of the national
electric code, the national fire code, and local regulations.
Through that journey, they left me with a rather interesting tidbit.

The more "urban" an area the more likely it is to have strict fire
codes. Typically these codes require a single EPO for the entire
structure, there's no way to compartmentalize to rooms or subsystems.
However in more rural areas this is often not so, and they had in
fact built data centers to code WITHOUT a single building EPO in
several locations. That's to say there was no EPO, but that it may
only affect a single room, or even a single device.

If they can be avoided, why do we put up with them? Do we really
want our colo in downtown San Francisco bad enough to take the risk
of having a single point of failure? How can we, as engineers, ask
questions about how many generators, how much fuel, and yet take
for granted that there is one button on the wall that makes it all
turn off? Is it simply that having colo in the middle of the city
is so convenient that it overrides the increased cost and the reduced
redundancy that are necessitated by that location?

Funny story about that and the EPO we have here...

We have chilled water cooling in our server rooms. A couple of years
ago we told the facilities guys there was sand in the lines. They
didn't believe us. This went back and forth for a few months until
the lines finally ground to a halt. They admitted sand was in the
lines.

The bring out an HVAC guy... he closes the valve, opens the pipe,
nothing comes out. He **opens** the valve, nothing comes out. He
whacks on the pipes with a wrench, all the sand and lots of water come
out very fast. By the time I got down there, the ceiling tiles were
drenched and looked more like sponges. Half the room was soaked.

That would be a good reason to have an EPO right there. :slight_smile:

j

If they can be avoided, why do we put up with them? Do we really
want our colo in downtown San Francisco bad enough to take the risk
of having a single point of failure? How can we, as engineers, ask
questions about how many generators, how much fuel, and yet take
for granted that there is one button on the wall that makes it all
turn off? Is it simply that having colo in the middle of the city
is so convenient that it overrides the increased cost and the reduced
redundancy that are necessitated by that location?

  You forgot the default "Single Point of Failure" in anything..

      HUMANS.

        Tuc/TBOH

Well.....

An Emergency Power Off button is a NEC (the electrical code adopted in most of the USA) trade-off for allowing more flexible wiring practices in computer rooms. If you don't use any of those alternative wiring practices, you aren't required to install an EPO (modulo the rare local code variation). The problem is some people want to get rid of the EPO, but also want to keep using alternative wiring practices.

If you've looked in many computer rooms, you'll see some creative wiring practices in use.

You'll notice Telco central offices don't have building EPOs. Likewise Equinix data centers don't have EPOs. But they have limits on what wiring practices can be used in their facilities compared to other
data centers.

The more “urban” an area the more likely it is to have strict fire
codes.

If they can be avoided, why do we put up with them? Do we really
want our colo in downtown San Francisco bad enough to take the risk
of having a single point of failure? How can we, as engineers, ask
questions about how many generators, how much fuel, and yet take
for granted that there is one button on the wall that makes it all
turn off? Is it simply that having colo in the middle of the city
is so convenient that it overrides the increased cost and the reduced
redundancy that are necessitated by that location?

We put up with EPOs for the same reason we put up with water-based fire suppression systems. Safety. When my firefighter buddy needs to cut through a wall in a data center, she better damn well be sure she can kill power to the entire area before she potentially cuts through 400-500V feeds in the walls. Hence, the EPO.

Now, I know you referred to a single EPO for the entire facility, and how that’s akin to doing brain surgery with a sledgehammer. Perhaps someday someone will come up with a more surgical method for ensuring power is removed from the areas it needs to, while other unaffected areas can continue functioning. Until that point though, I prefer we err on the side of caution and value someone’s life over business continuity.

Almost forget. You mentioned more “urban” areas require the master EPO being easily accessible in the facility. I don’t know what gets more urban than the San Fran area.

-brandon

For high-availability sites (Tier III, Tier IV per UpTime Institute), EPO's are
one of the most common reasons for outage. I'd highly recommend APC's
paper <http://www.apcmedia.com/salestools/ASTE-5T3TTT_R2_EN.pdf>
on the topic.

Short-version is that its a safety issue for room occupants and responders.
More mature codes tend to require such, particularly in the presence of
UPS gear which can be energized unbeknownst to fire fighting personnel.
If you don't have water-based fire suppression, have normally unoccupied
spaces, and are continuously manned, it's sometimes possible to pass on
having an EPO. YMMV by inspector.

/John

Leo Bicknell wrote:

I was complaining to some of the power designers during the building
of a major facility that the EPO button represented a single point
of failure, and effectively made all of the redundancy built into
the power system useless. After all, what's the point of having
two (or more) of anything, if there's one button somewhere that
turns it all off?
  

Seems like the EPO should be a logical AND with the fire alarm system - it only works AFTER you have an existing fire alarm in the building.

John C. A. Bambenek wrote:

Funny story about that and the EPO we have here...
...

Story #1

Many years ago, the safety department for my employer made a big stink
over the fact that the EPO hadn't been tested in a couple of years. We
scheduled an outage window, shut everything down. The facilities guy
pressed the magic big RED button and NOTHING! Tracing the problem back,
there was a blown fuse in the EPO circuit because a wire had shorted. A
real safe design!

Story #2

Every few years the EPO buttons would change. First they were the ones
with the metal ring around the button that protects against accidental
pushing. Then we would get the mushroom button because it was "safer".
Invariably someone would trip it and they would change them back. I
think some guy made some money submitting suggestions to change the
button every few years.

I've always wondered who died or was injured and caused the EPO to come in to existence. There have been lots of "EPO caused downtime" stories, but does anyone on the NANOG list even have one single "Thank God for the EPO" story? I'll feel better about the general state of the world if I know that the EPO actually has a real valid use that has been ACTUALLY PROVEN IN PRACTICE rather than just in someone's mind.

-Jerry <----Is so anti EPO, he has no remote EPO buttons, and even has the irrational fear about the jumper on the "EPO terminal strip" inside his UPSes coming undone.

In fact, an EPO system is a single point of failure...

And, whether or not you need an EPO in your center is wholly up to you,
and how you design your center.

As mentioned at a recent seminar I went to:

"If you do not need to install non-plenum rated cable below a floor, and
you require boxes under the floor to be secured, and you do not state
NFPA 75 as your standard, then you do not need an EPO as defined by NEC
645."

Only if you want exceptions granted in 645 ("Information Technology
Equipment"), should you have to install an EPO.

EPO = SPOF = bad. We all know this.

> If they can be avoided, why do we put up with them? Do we really
> want our colo in downtown San Francisco bad enough to take the risk
> of having a single point of failure? How can we, as engineers, ask
> questions about how many generators, how much fuel, and yet take
> for granted that there is one button on the wall that makes it all
> turn off? Is it simply that having colo in the middle of the city
> is so convenient that it overrides the increased cost and the

reduced

If they can be avoided, why do we put up with them? Do we really
want our colo in downtown San Francisco bad enough to take the risk
of having a single point of failure? How can we, as engineers, ask
questions about how many generators, how much fuel, and yet take
for granted that there is one button on the wall that makes it all
turn off? Is it simply that having colo in the middle of the city
is so convenient that it overrides the increased cost and the reduced
redundancy that are necessitated by that location?

  You forgot the default "Single Point of Failure" in anything..

      HUMANS.

The earth is a SPoF. Let's put DCs on the moon.

Besides, safety always overrides convenience. And I don't think that is a bad trade off.

Me neither...

Having multiple redundant sites (and a well designed network between them) is almost always going to be better than a single, wildly redundant site. No matter how much redundancy you build into a single site, you cannot (realistically) engineer away things like floods, etc. Planning your redundancy and testing it though is very important...

Random anecdote (from a friend, I don't know if it true or not):
Back in the day (before cheap international circuits), a very large financial in New York needed connectivity to some branches in Europe, so they bought some capacity on a satellite transponder and built their own ground-station (not cheap) fairly close to NY. They then realized that the needed a redundant ground station in case the first one failed or something similar, so the built a second ground-station, just outside Jersey City....

One of the satellite connectivity failure modes is... rain fade.....

W

This is an interesting question.

National Electric Code (NEC) requires EPO. Sort of. Articles 645 and 685
deal with it.

While NEC is not binding on every jurisdiction, almost every US
jurisdiction bases its code on NEC with additions/subtractions. I don't
know offhand if the local changes deal with EPO much, however, here's some
food for thought regarding EPO and NEC.

With regard to "putting up with them" - EPOs are designed to protect life,
not property or uptime. If there's a short causing electrical fire because
breaker did not open, firefighter better be sure he can cut the power
*before* stepping next to it.

Here's how NEC works:

1) If a room is designed to comply with Article 645, it must have EPO,
*except* if it qualifies under Article 685.

Being under Article 645 gives couple of things that are generally not
permitted otherwise, as follows:

645.4 D) permits underfloor wiring for power, receptacles and
crossconnects.

645.4 E) "Power cables; comunications cables; connecting cables;
interconnecting cables; and associated boxes, connectors plugs and
receptacles that are listed as part of, or for, information technology
equipment shall not be required to be secured in place".

In other words, you can have crossconnects that are laying on the floor
(or under raised floor but not otherwise secured), and that is OK,
normally they'd need to be secured every X feet.

645.17) (too lazy to retype NEC language) You can have PDUs with "multiple
panelboards within a single cabinet" - not all that clear what exactly
does it permit (PDUs with multiple breaker panels essentially).

My understanding is that if you are willing to forego things that
Article 645 permits, you do not have to install EPO. Frankly, I don't see
all that much logic in 645 requirements and linking it to EPO (except,
possibly, to make operation of datacenters not in compliance with 645 to
be annoying enough that everyone would opt to comply with EPO).

The Article 685 exception from EPO applies if "An orderly shutdown is
required to minimize personnel hazard and equipment damage". It is really
intented for industrial (like chemical plants control) systems where EPO
shutoff can cause damage to life/property. I doubt this applies to
datacenter.

Above is an armchair engineer's understanding. To be sure, you should
consult a real engineer who can stamp and seal your plans!

-alex

If you don't have water-based fire suppression, have normally unoccupied
spaces, and are continuously manned, it's sometimes possible to pass on
having an EPO. YMMV by inspector.

That is indeed true, as we were able to have ours disconnected, and were able to expand our facility without adding EPOs in the new datacenter rooms.

We were told that since we met a specific list of criteria, which included things like FM200/Ecaro25 fire suppression, solid (not raised) floors, electrical wiring done some particular way, and large signage indicating where the safety hazards lie for our friends in the Tukwila & Seattle Fire Departments... we got a pass on the Big Red Button. I imagine those criteria change from place to place, and time to time... perhaps even inspector to inspector.

I sleep better at night. Well... A little bit better.

--chuck

Story #3

So about 4 -5 years ago, we were in the middle of a major renovation of our
server room. Moving machines all over the place, trying to clear about
6K contiguous square feet of floor space to drop a top-5 supercomputer in.
Upgrading the power, bringing in another 1.5Mw feed, cooling to get the
resulting BTUs *out*, etc. And we decide it's time to put in a new 600kw
diesel backup generator to replace the old one that was way too small, for
all the non-supercomputer systems in the room.

So we take a multi-hour outage one Saturday for a full powerdown so we can wire
all the new UPS gear in. And one of our scarier moments is rebooting the Sun
E10K, because it was a bit long in the tooth, and had 400 disk drives, and
hadn't been powered off in so long we weren't sure if it *would* power up again
without field engineering assistance. And it *had* to come back up, because
it had all the Oracle databases that had all our business records, HR,
student records, everything. There's a few tense moments - we lose about a
dozen drives, but fortunately they're all in RAID sets and no more than one
drive per set died. We also notice that we dodged a bullet - the main boot
drive was supposed to be mirrored, but due to a config error, wasn't.

Tuesday, that boot drive is moved, it's now mirrored on 2 drives.

Friday, some construction guys come in to move the main entrance door into the
room - it has to move about 20 feet to the right so you can go *around* the
supercomputer, rather than walk straight into it. And as per plan, one of them
starts moving the kind of odd light switch junction box next to the door, to
its new location next to the new door. Unfortunately, as *not* per plan, he
fails to double-check with our Facilities team that it's been disarmed first...

5 seconds later, it's very quiet and foggy in the room, as the Halon has dumped
and the interlock with the EPO has killed the power.

Several hours later, we finally get to start powering up the Sun E10K.

The good news: We only lost 2 drives out of 400 this time, rather than a dozen.

The bad news: Guess which 2 failed.....

Story #4

I'm still working at the place mentioned in a previous post -- I was only there for 3 months (actually one day less than 3 month, I know this because the recruiter only got his commission when I was there for at least three months, if I'd know this I would have stuck it out for another few days), but have more "funny" stories from this place than any other, anyway, onto the story:

One of the server rooms becomes unusable and needs to be rebuilt[0], so everything needs to be migrated out of the existing room and into new space -- this includes a large APC Symmetra UPS. We shut down the UPS and pull all of the batteries out of both it and the expansion shelves so that we can move it with a pallet lift. We move everything into the new space and its time to put the UPS back together. I quickly decide that lifting large numbers of heavy batteries into the shelves is not fun, so I show the random helper dude what to do... "You pick up this big, heavy thing and put in into this cubbyhole type spot, then you connect this large connector and slide the battery back, lather, rinse, repeat...".

I watch him do the first one and he seems to have it figured out... I wander off to go hook up some fiber or something and peer down the corridor every now and then to make sure he still has this under control. Surprisingly enough he is managing ok and hasn't wandered off to take a nap or anything. He gets down to the last few batteries and seems to be having some issues, but I figure he'll work it out, so I carry on with what I am doing... I peer down the corridor again and he is sitting on the floor with his back braced against something, pushing the battery into place with his feet... "Whoa, this can't be good", I think, just as there is a LARGE bang, a big flash and much smoke and fire....

Turns out that for the last battery he managed to get the cables caught between the side if the battery and the side of the (sheet-metal) case. When it didn't just slide easily back, he pushed it really hard and the edge of the case chomped through the cable creating a dead short -- this literally vaporized a crescent of metal from the case around 5 inches in radius, flung bits of molten case and battery leads all over the place and ignited the cardboard that we put on the pallet to soften it...

Much hilarity ensues...

Sometime I really need to write down all of the funny things that have happened over the years... Actually, if anyone has other, random funny (?!) stories, pass them along and I'll make a compilation....

W

[0]: Have you ever noticed that places that use gas fire suppression systems either have doors that open outwards and / or big dampers (like http://www.c-sgroup.com/product_home.php?section=explovent&page=3) ? Ever wonder why? :slight_smile:

Sometime I really need to write down all of the funny things that
have happened over the years... Actually, if anyone has other, random
funny (?!) stories, pass them along and I'll make a compilation....

[Howard C. Berkowitz]

While working at a distinguished university with a religious affiliation, I
learned, as did one of the priest-biologists, not to refer to a piece of
instrumentation as possessed. While one of the priest-theologians meant
well, we learned what happened when holy water is sprinkled into the high
voltage supply of a gas chromatograph. Beckman Instruments was so amused
they didn't charge for equipment abuse not under maintenance contract.

I do. Hurricane Wilma, blew the roof off our building, water pouring in pooling under the floor and onto the PDUs and UPS (800amps of 480v). We wanted to save the data on the servers, had to hit the EPO to enter the room (anyone have an idea of how far that much power would arc?). It was STILL quite scary since the batteries were still charged, I actually flipped the breaker on the UPS. Not fun to be around that much power when there is a lot of water. Only time I've ever seen an EPO hit in person.

Jerry Pasker wrote:

Alex Rubenstein wrote:

EPO = SPOF = bad. We all know this.

I fail to see why one couldn't have TWO buttons of the same
type that needed to be pressed after one another to close
it down. It is unlikely that someone would trip and touch
two separated buttons (although put close to one another).

Probably some logic behind why we don't have this. Or not?

I fail to see why one couldn’t have TWO buttons of the same type

This is done on quite a few lumps of industrial machinery.

While one of the priest-theologians meant
well, we learned what happened when holy water is sprinkled into the high
voltage supply of a gas chromatograph

That’s a literal example of what happens when faith and science collide.

More broadly, quote of note from Royal Marine officer after recent floods in the UK - they were shoring up the walls of a major power-grid switching station, with water inside the facility and much more outside. “I remembered electricity and water don’t mix, but it wasn’t a good moment to think that…” With 600,000 customers hanging off it, needs must when the devil drives.