What to expect after a cooling failure

As some may know, yesterday 151 Front St suffered a cooling failure after Enwave's facilities were flooded.

One of the suites that we're in recovered quickly but the other took much longer and some of our gear shutdown automatically due to overheating. We shut down remotely many redundant and non-essential systems in the hotter suite, and transferred remotely some others to the cooler suite, to ensure that we had a minimum of all core systems running in the hotter suite. We waited until the temperatures returned to normal, and brought everything back online. The entire event lasted from approx 18:45 until 01:15. Apparently ambient temperature was above 43 degrees Celcius at one point on the cool side of cabinets in the hotter suite.

For those who have gone through such events in the past, what can one expect in terms of long-term impact...should we expect some premature component failures? Does anyone have any stats to share?

Thanks

Hello,

In my experience with heating issues the only thing that really degrades
quickly in event of overheating are hard drives. If you had them spun down
it should be fine.

CPU / Memory / Motherboards will be fine.

The only other thing I can think of having possible issues are PSU's but if
they were powered off should be fine as well. Maybe melted wires but I dont
think it was hot enough for that.

Thanks

Thanks. I should also mention that most of the gear was still on but we had turned off many VMs on physical servers within the first 2.5 hours, so the CPU and hard drive / io load was around zero on such servers. Most of the servers in the hotter suite had fans running at over 75% vs. about 35% in the cooler suite and ambient temp was down to 32 degrees Celcius within four hours.

If the HDDs were spinning while above rated maximum ambient intake temp,
*especially* if they're not *right out front in the intake path* (is
anything not built that way anymore? Yeah; the back side of 45-drive
Supermicro racks, among other things), you should probably plan on doing
a preemptive replacement cycle, or at the very least, pay *very* close
attention to smartctld, and have a good stock of pre-trayed replacements.

Remember that you may fall in the RAID Hole if you wait for failures,
and hence lose data which isn't backed up anyway -- if more drives in a
raid group fail *during rebuilds*, you're essentially screwed.

If your raid groups were properly dispersed across drive build dates, then
this will probably be *slightly* less dangerous, but still.

Also watch bearing-type fans.

Cheers,
-- jra

For those who have gone through such events in the past, what can one expect
in terms of long-term impact...should we expect some premature component
failures? Does anyone have any stats to share?

Realistically... you had a single short-lived stress event. There
are likely to be some number of random component failures in the
future. It is unlikely that you will be able to attribute the
failures to such a short lived stress event of that magnitude --
there might on average be a small increase over normal failure rates.

The bigger concern, may be that /a lot of different components/
could have been subject to the same kind of abuse at the same time:
including sets of components that are supposed to be in a redundant
pair and not fail simultaneously.

I wouldn't necessarily be so concerned about premature failures ---
I would be more concerned, that you may have redundant components
that were exposed to the same stress event at the same time; now
the assumption that their chances of failure are independent may
become more questionable --- the chance of a correlated failure in
the future might be greatly increased, reducing the level of
effective redundancy/risk reduction today.

That would apply mainly to mechanical devices such as HDDs.

Honestly, I think your hardware will be fine just like everyone else said
keep an eye on your hard drives they are by far the most sensitive.
Anything not mechanical if it didnt melt you're good.

One data center we had equipment in was 153F for about a week and all we
saw were drive failures and they were still fairly sparse. 1 out of 10 I
would say.

Thanks

While others have already talked about what to look out for in terms of systems and drives, I haven't seen anyone mention things like your UPS batteries. Were they also heat-soaked? At one place I worked at, we lost a whole bank of batteries in the UPS room when it overheated. I think that was somewhere around a $95,000 replacement and required rush-delivery of a lot of SLAs from all over the place.

I have experience with a different kind of event that might be of interest to a wider audience.

When the fire suppression system went off in a site, we had a lot of instant harddrive failures. I don't have any numbers, but let's say 5-10% of all hdd:s in the room died more or less instantly. Supposedly this was because of the air pressure shock when the inert fire suppression gas was released and the vents weren't big enough to release the overpressurised air outside.

I did some research and there are forum posts etc about these kinds of events happening in other places.

So, takeaway from this was RAID is an uptime tool, not a substitute for backups, and also, get a qualified ventilation/fire supression systems engineer to inspect your sites from this aspect.

I have seen DDR2 RAM give random errors from inadequate cooling. The cabinets were stacked to the max with severs but the doors were not meshed. DDR2 run fairly hot, especially when all the banks are filled.
Tri Tran

* Erik Levinson <erik.levinson@uberflip.com>:

[cooling failure]

For those who have gone through such events in the past, what can
one expect in terms of long-term impact...should we expect some
premature component failures? Does anyone have any stats to share?

We had a similar event (temperatures were a bit higher at 49°C,
duration was a bit shorter, 10am to 3pm) this January. In the two days
after the event, two of our HP servers had drives that went from "OK" to
"Predictive Failure", which is the SmartArray controller's way of
telling about high error rates. Two weeks after, we had a single DIMM
with an uncorrectable ECC error, causing a server reboot. Three weeks
after, a single PSU failed.

In our opinion, the disk problems were caused by the cooling failure,
while the ECC error and the faulted PSU were probably not related.

I believe that your hardware will be fine, but it probably wouldn't be
a bad idea to check if you have current maintenance contracts/warranty
for your servers, or any other way of obtaining replacement drives in
a reasonably short time.

Cheers
Stefan

Numbers from memory and filed off a bit for anonymity, but....

A site I was consulting with had statistically large numbers of x86 servers (say, 3000), SPARC enterprise gear (100), NetApp units (60) and NetApp drives (5000+) go through a roughly 42C excursion. It was much hotter at ceiling level but fortunately high (20 foot) ceilings. Within about 1C of the (wet pipes) sprinkler system head fuse temp... (shudder)

Both NetApp and X86 server PSUs had significantly increased failure rates for the next year. Say in rough numbers 10% failed in the year. About 2% were instant fails.

Hard drives had a significantly higher fail rate for the next year, also in the 10% range.

No change in rate of motherboard or CPU or RAM failures was noted that I recall.

George William Herbert

Ugly.

If the batteries that were in the facility's power distribution system were
affected by the heat, then their life is likely significantly shortened.
This is in terms of their capacity to supply power in the event of an outage
and a shortened shelf life.

Lorell

As some may know, yesterday 151 Front St suffered a cooling failure after

Enwave's facilities were flooded.

One of the suites that we're in recovered quickly but the other took much

longer and some of our gear shutdown automatically due to overheating. We
shut down remotely many redundant and non-essential systems in the hotter
suite, and transferred remotely some others to the cooler suite, to ensure
that we had a minimum of all core systems running in the hotter suite. We
waited until the temperatures returned to normal, and brought everything
back online. The entire event lasted from approx 18:45 until 01:15.
Apparently ambient temperature was above 43 degrees Celcius at one point on
the cool side of cabinets in the hotter suite.

For those who have gone through such events in the past, what can one

expect in terms of long-term impact...should we expect some premature
component failures? Does anyone have any stats to share?

Another failure I've seen connected to overheating events is AC power supply failures.

This has been a very interesting thread.

Google pointed me to this Dell document which specs some of their servers having an expanded operating temperature range
*** based on the amount of time spent at the elevated temperature, as a percentage of annual operating hours. ***

ftp://ftp.dell.com/Manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r710_User's%20Guide4_en-us.pdf

I mention that because the "1% of annual operating hours" at 45 C would be two degrees higher than the 43 C stated as reached in the original email.

It would seem that Dell recognizes that there might be situations, such as this, where the "continuous operation" range (35 C) is briefly exceeded.

Tony Patti
CIO
S. Walter Packaging Corp.