UPS failure modes (was: fire at NAC)

I had a little 2000VA rackmount Liebert UPS catch fire in 1997 and another new and improved Liebert model almost catch fire about a year later. Both were operating well within specified input, load, and temperature parameters. I haven't really trusted them since.I bought dual MGE UPSes for our datacenter in 2002. I figured if E****s can flip them on and off randomly and massively overload them all in an environment which is 95 degrees F, then they should hold up nicely for us when lightly loaded at 65 degrees F. :slight_smile:

The reason for this rambling post is to ask if others have had similar problems with other UPS brands. I think they should have enough fail-safes built-in that they are never the CAUSE of an outage much less a fire! Based on my experience and NAC's incident today, is that an unreasonable expectation? I don't think manufacturers specify MTBF (mean time before fire) figures for UPS units. What have others experienced as the failure mode(s) for their UPS(s)? The static transfer switch should drop the load onto line/bypass power immediately and shut down the inverter while tripping the battery disconnect at the first sign of trouble - does this work as designed and advertised most of the time or just some of the time? Of those with UPS failure histories, what has happened in your situation?

Robert

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com | 888-TELLURIAN | 973-300-9211
"Good will, like a good name, is got by many actions, and lost by one." - Francis Jeffrey

I had a little 2000VA rackmount Liebert UPS catch fire in 1997 and another

    > new and improved Liebert model almost catch fire about a year later.
    > What have others experienced as the failure mode(s) for their
    > UPS(s)?

We had a two-hour grid power outage here in Berkeley yesterday, during
which time our APC Symmetra 16kva fried two of its four batteries, and
went into bypass mode, which meant that the transition back from generator
to grid caused everything to reboot. :-/

I've seen two previous APCs (both Matrixes) fry batteries... The
batteries balloon up, and get really hot, and are too big to extract from
the chassis. APC's solution to this is to have us take the entire UPS
offline for several days to completely dissipate the heat, and then try to
force the batteries out. Since this seems to be an endemic problem, you'd
think they'd just design a chassis with somewhat more clearance around the
batteries so that failed ones could still be physically extracted.

                                -Bill

I am personally of the opposite opinion, we have never had any issues
with our Liebert UPS'es, however we have had a few MGE's blow up. I
can't comment on their small UPS models though, I think the smallest MGE
or Liebert we have is 10KVA.

The worst of the cases was an installation where we had dual 40KVA MGE
UPS'es installed, both of whom failed critically within 48 hours of each
other. Despite all the fail-safe circuitry they were bought with, they
failed HARD (and yes, they had sparks and smoke coming out of them), and
not even the bypass features worked. Since the second failed before the
first one had been completely restored (it was being investigated to
find the root cause of this critical failure), things went very black,
and electricians had a pretty hectic time as they had to manually bypass
the UPS'es completely and feed grid power directly to the facility.

Unfortunately these two were bought at the same time (and came from the
same production batch), so they had the same fault - which was
apparently a bad shipment of capacitors which started to leak fluid
after a period of time. Due to some unfortunate design choices in the
MGE's, these capacitors happened to be placed directly above the main
controller circuitry, and the leaky capacitors eventually caused the
whole thing to short, in a rather spectacular way I might add.

And yes, the bypass failed as well, we were explained the reason for
this by the engineers from MGE although I can't say I remember the
details (electricity really isn't my field :). That being said, after
the replacement of the fried components, the engineers from MGE came
on-site and rebuilt the entire bypass system in these two boxes some
time later, at no charge of course - and we have not had any problems
with them in the two years they have now been in operation after this
incident.

/leg

I've seen two previous APCs (both Matrixes) fry batteries... The
batteries balloon up, and get really hot, and are too big to extract from
the chassis. APC's solution to this is to have us take the entire UPS
offline for several days to completely dissipate the heat, and then try to

I've got a few APC SmartUPS's in the shop at home that I disassembled last
week. Two SU1400's and one much older SU 2000. All of them had
overheated and exploded cells. The 1400's all leaked and were /very/
wedged in. I ended up removing the outer shell, then the circuit board.
Once these were out of the way they frame spread enough to allow the
swollen casings to slide out. Both of these had acid all over the
bottoms.

One of these had been leaking so long that it ate a ~1/4 inch hole through
the bottom.

The 2000 had the worst batteries, but once I got the plastic battery
casing opened (I CAN see why they stopped using this design...) and got
all of the connections apart, the cells just dumped right out.

Maybe I'll take some pictures of the batterys...it was sort of fun to see
how they had expanded and wrapped around each other. :slight_smile:

I've only heard one report of APC's that caught fire...and even then it
was just the carpet below that caught fire.

...david

I've also personally witnessed an APC do this. I'm not a fan of APC.
They sent us a replacement APC but I still prefer the rack mount Tripp
Lites we used at the last company I worked for.

Gerald

Anyone had experience with Belkin UPSes? Theyre much cheaper than APC,
and seem to have a longer runtime. I wonder about the long term
reliability though. Data points would be helpful.

-Dan

<running to move my APC UPS...>

Justin

We were all set to buy a dozen APC 750s, a half dozen or so APC 1400s, and
another hald dozen APC 2200s a year or so ago. We had approval for the
purchase from the director and provided the exact model numbers and
quantities to our purchasing officer. Our purchasing officer came back
sometime later and suggested we purchase Minuteman UPSs because they were
on sale at the time. We never did get to purchase any UPSs. grr...

That said, Minuteman's Entreprise UPS series looks to be pretty good.
Does anyone have any experience with them? We were planning on buying
large 2x00VA UPSs to handle Enterasys 6000 chassis (pl).

Justin

(header trimmed)

Hello,

First off, we're all still alive here. Underlying root cause was a failure
of a capacitor in the rectifier section. We're not sure what actually
caused the failure of the failure of the capacitor, but it resulted in the
internals of the capacitor being ejected from the UPS at such a high rate
of speed, that it dented the front door of the UPS itself, and caused the
door to jump the lock and swing open.

I, personally, have never been a fan of Liebert UPS's. The electrical
engineer that we use seems to share my assessment that Lieberts, at least
Series 300's, are not built as well as the could be.

I have no direct experience with MGE, but I recall several multi-hour
outages in Jersey City Exodus, that I think had something to do with MGE
systems. I don't recall if that was human error, or not. Another negative
there, for me, is that they are French.

We own about 8 or so Matrix 5000's; out of box failure rate is hovering at
about 50%, and failure rate within first month of operation is about 75%.
However, once they pass that barrier, they tend to work. Don't overload
them, they tend to get cranky. We had one shoot flames, once, but that
wasn't assosciated with an overload.

My personal favorite: Exide/Powerware/Invensys 9315's. They just work. I
have two of them, an 80 and a 500. The 80 has been installed for nearly 4
years, and has never, ever dropped the critical load unless instructed to.
The 500 is a recent install, but seems to be doing just fine as well.

From folks I've talked to (engineers and industry people), Powerware seems

to be known as the UPS that just works. I've yet to talk to one person who
had a powerware die on them. Myself included.

The reason for this rambling post is to ask if others have had similar
problems with other UPS brands. I think they should have enough fail-safes
built-in that they are never the CAUSE of an outage much less a fire! Based
on my experience and NAC's incident today, is that an unreasonable
expectation? I don't think manufacturers specify MTBF (mean time before

A quote from the facility manager of a large ISP with multiple data
centers in Northern Virginia, each using a different brand of UPS and
backup systems (batteries, flywheels, generators, etc).
Q: Which UPS brand do you think is best?

UPSes (and UPS batteries) do fail, sometimes in catastrophic ways. I
would not design any critical system on the assumption that any particular
component won't fail. High availability is about designing for failure.
Sometimes there is a long time between failures, other times they occur
early and often. The most annoying thing about UPSes is they fail at
exactly the time they are needed most.

The FCC NRIC Focus Group 2 is now accepting "voluntary" outage reports
from Internet Service Providers. See http://www.nric.org/ for details.

    > I had a little 2000VA rackmount Liebert UPS catch fire in 1997 and another
    > new and improved Liebert model almost catch fire about a year later.
    > What have others experienced as the failure mode(s) for their
    > UPS(s)?

We had a two-hour grid power outage here in Berkeley yesterday, during
which time our APC Symmetra 16kva fried two of its four batteries, and
went into bypass mode, which meant that the transition back from generator
to grid caused everything to reboot. :-/

I've seen two previous APCs (both Matrixes) fry batteries... The
batteries balloon up, and get really hot, and are too big to extract from
the chassis. APC's solution to this is to have us take the entire UPS
offline for several days to completely dissipate the heat, and then try to
force the batteries out. Since this seems to be an endemic problem, you'd
think they'd just design a chassis with somewhat more clearance around the
batteries so that failed ones could still be physically extracted.

gell cells suffer from an electron mobility problem relative to
traditional lead acid batteries. If you pull to much current off a stack
of them you can boil the electrolyte off in a very big hurry, but because
they're sealed they'll distent before they explode instead of just venting.

I have run a matrix 5000xr with 4 battery enclosures down to zero
under 65% load (220 minutes) without any untoward effects. in 100% load
or overload conditions without forced air cooling (we lose ours in power
outages) things could get uncomfortably warm.

UPSes (and UPS batteries) do fail, sometimes in catastrophic ways. I
would not design any critical system on the assumption that any particular
component won't fail. High availability is about designing for failure.
Sometimes there is a long time between failures, other times they occur
early and often. The most annoying thing about UPSes is they fail at
exactly the time they are needed most.

Except, that:

Even in instances where 'High availability' is designed, in the case where
one of the units has a failure that causes a fire and FM200 dump, either
the FM200 will still trigger an EPO, or the fire department will.

So, the second 'high available' unit will generally not prevent you from
dropping the critical load, but instead, will help you get back on line
quicker.

A much cheaper and easier to implement external maintenance
make-before-break bypass will accomplish the same thing.

I've heard many a story of the paralleling gear causing the problem in the
first place, as well...

-- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben --
-- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --

Or we could all take a page from the book of telecom, and run with DC systems.

No inverters involved, lots of parallel rectifiers and battery power just
sitting there.

If only the equipment manufacturers would stop gauging on price for
DC equipment/power supplies.

Dan.

Alex Rubenstein wrote:

We are a Powerware house. We had a large number of 3kVA and 6kVA units in
our previous data centre (no-one would stump up the cash for a large unit
so we had to buy them as we needed them). After about 5 years (very rough
figure), we've now had 3 or 4 units fail, sometimes in the UPS, sometimes
in the bypass unit. They all seem to be component failures (in the case
of the bypass unit, a leg broke off a small capacitor). I don't think we've
replaced any of the batteries in that time and they're still all holding charge
well, even at full load.

At our new site we have some 50kVA and 80kVA units that we inherited from the
previous owners. One is already exhibiting the signs of a failing fan, but
we have no idea what their history is before we moved in.

Simon

> From folks I've talked to (engineers and industry people), Powerware seems
> to be known as the UPS that just works. I've yet to talk to one person who
> had a powerware die on them. Myself included.

We are a Powerware house. We had a large number of 3kVA and 6kVA units in
our previous data centre (no-one would stump up the cash for a large unit
so we had to buy them as we needed them). After about 5 years (very rough
figure), we've now had 3 or 4 units fail, sometimes in the UPS, sometimes
in the bypass unit. They all seem to be component failures (in the case
of the bypass unit, a leg broke off a small capacitor). I don't think we've
replaced any of the batteries in that time and they're still all holding charge
well, even at full load.

Perhaps I should have been clear. In my entire post,
  sed s/powerware/powerware 9315/g

At our new site we have some 50kVA and 80kVA units that we inherited from the
previous owners. One is already exhibiting the signs of a failing fan, but
we have no idea what their history is before we moved in.

Well, fans would fall under maintenance, no? Perhaps, also, someone is not
changing filters, or whatnot.

-- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben --
-- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --

Even in instances where 'High availability' is designed, in the case where
one of the units has a failure that causes a fire and FM200 dump, either
the FM200 will still trigger an EPO, or the fire department will.

Why do you think most telephone central offices don't have EPO's? It is
possible to meet code without an EPO, if you have a smart PE on the
project.

So, the second 'high available' unit will generally not prevent you from
dropping the critical load, but instead, will help you get back on line
quicker.

That's why you have geographic diversity, if one node goes down the other
location may be unaffected.

A much cheaper and easier to implement external maintenance
make-before-break bypass will accomplish the same thing.

Pick two out of three. The "Internet philosphy" has tended to be a
lots of cheap equipment connected by diverse paths. Designing for
failure also means defining "failure" in terms of the service, not
particular pieces of equipment. I don't care how many 9's your switch
is, I just care if my packets get through.

I've heard many a story of the paralleling gear causing the problem in the
first place, as well...

Yep, tieing together "redundant" systems with parelleling gears turns two
independent systems into one "co-dependent" system. In a failure
situation, you want to compartmentalize the failure. Loosing half your
systems may be better than loosing all your systems.

They suck too, just in different ways.

We are opening a new facility in SF and are seriously considering the idea
of by passing a large UPS (150-225KVA)altogether and relay on a generator
400-450KW with small UPSes on each rack. A UPS failure would be limited to
a single rack, this way we could

My personal experience with ATSes is limited and would appreciate any
feedback.

Our new facility is dual feed from two different power grids and we could
provide two independant power feeds to each rack one backed via a generator
and the other std house power. Customer are welcome to bring their own
UPS. (we know it takes away rack space).

FYI; Our experience has shown that all UPSes (we have used APC, Best,
Liebert and Minute Man) have all failed within a 5 year period. We have
over 30 APC UPSes and a ~20% failure rate.

thanks
arman

Joel Jaeggli wrote:

Arman wrote:

We are opening a new facility in SF and are seriously considering the idea
of by passing a large UPS (150-225KVA)altogether and relay on a generator
400-450KW with small UPSes on each rack. A UPS failure would be limited to
a single rack, this way we could

We run a configuration similar to this, except we do failure per-row
with one APC Symmetra supporting between 3 and 6 cabinets, depending on
the projected load. In the past 2.5(?) years, we've had one controller
failure that did not cause an outage. All the batteries are fine,
though we have not run them to zero. It turned out to be a very clean
install as far as conduit and cabling goes, we're very happy with it.

I have no direct experience with MGE, but I recall several multi-hour
outages in Jersey City Exodus, that I think had something to do with MGE

Correct. There were, if I recall correctly, a total of five in a short
period, including two in one night. To be fair, the second of those was
human error, the result of aggressive management pushing hard against
engineering recommendations at three in the morning when folks were
already hyped up on adrenaline and lack of sleep from the first.

That was when Exodus lost my business forever.

systems. I don't recall if that was human error, or not. Another negative
there, for me, is that they are French.

*snicker*

-Pete