"Hypothetical" Datacenter Overheating

Let’s say that hypothetically, a datacenter you’re in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?

One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.

All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.

You will start to see premature failure of equipment over the coming weeks/months/years.

Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now... :frowning:

Coincidence indeed… :wink:

Hi Mike,

A decade or so ago I maintained a computer room with a single air
conditioner because the boss wouldn't go for n+1. It failed in exactly
this manner several times. After the overheat was detected by the
monitoring system, it would be brought under control with a
combination of spot cooler and powering down to a minimal
configuration. But of course it takes time to get people there and set
up the mitigations, during which the heat continues to rise.

The main thing I noticed was a modest uptick in spinning drive
failures for the couple months that followed. If there was any other
consequence it was at a rate where I'd have had to be carefully
measuring before and after to detect it.

Regards,
Bill Herrin

I think we're beyond "hypothetical" at this point, Mike ... :wink:

Our Zayo circuit just came up 30 minutes ago and it routes through 350 E Cermak. Chillers were all messed up. No hypothetical there. :slight_smile: It was down for over 16 hours!

I’m more interested in how you lose six chillers all at once.

Shane

and none in the other two facilities you operate in that same building had any failures.

Easy. Climate change. Lol!

-mel

Exactly. Perhaps they weren’t all online to begin with…

Easy. Climate change. Lol!

It was -8°F in Chicago yesterday.

yes but.... it has been -8 in Chicago plenty of times before this. Very interested in root cause...

Absolutely. My point was that claiming "Global warming" isn't going to fly as an excuse.

My sarcasm generator is clearly set incorrectly :slight_smile:

-mel

+1

Is their design N+1?

https://www.equinix.com/data-centers/americas-colocation/united-states-colocation/chicago-data-centers/ch1

We’re not smashing temp records in Chicago. At least it doesn’t seem so when you look across historical data:

https://www.weather.gov/lot/Chicago_Temperature_Records

HTH,

-M<

Extreme cold. If the transfer temperature is too low, they can reach a
state where the refrigerant liquifies too soon, damaging the
compressor.

Regards,
Bill Herrin

I’m more interested in how you lose six chillers all at once.
Extreme cold. If the transfer temperature is too low, they can reach a
state where the refrigerant liquifies too soon, damaging the
compressor.
Regards,
Bill Herrin

Our 70-ton Tranes here have kicked out on ‘freeze warning’ before; there’s a strainer in the water loop at the evaporator that can clog, restricting flow enough to allow freezing to occur if the chiller is actively cooling. It’s so strange to have an overheating data center in subzero (F) temps. The flow sensor in the water loop can sometimes get too cold and not register the flow as well.

Major double-take there for this non-US reader, until I realised you
just had to mean Fahrenheit.

Regards, K.

Someone I talked to while on scene today said their area got to 130 and cooked two core routers.

Something worth a thought is that as much as devices don't like being
too hot they also don't like to have their temperature change too
quickly. Parts can expand/shrink variably depending on their
composition.

A rule of thumb is a few degrees per hour change but YMMV, depends on
the equipment. Sometimes manufacturer's specs include this.

Throwing open the windows on a winter day to try to rapidly bring the
room down to a "normal" temperature may do more harm than good.

It might be worthwhile figuring out what is reasonable in advance with
buy-in rather than in a panic because, from personal experience,
someone will be screaming in your ear JUST OPEN ALL THE WINDOWS
WHADDYA STUPID?

Is this common sense, or do you have reference to this, like paper
showing at what temperature change at what rate occurs what damage?

I regularly bring fine electronics, say iPhone, through significant
temperature gradients, as do most people who have to live in places
where inside and outside can be wildly different temperatures, with no
particular observable effect. iPhone does go into 'thermometer' mode,
when it overheats though.

Manufacturers, say Juniper and Cisco describe humidity, storage and
operating temperatures, but do not define temperature change rate.
Does NEBS have an opinion on this, or is this just a common case of
yours?