"Hypothetical" Datacenter Overheating

Mike_Hammett · January 15, 2024, 2:08pm

Let’s say that hypothetically, a datacenter you’re in had a cooling failure and escalated to an average of 120 degrees before mitigations started having an effect. What are normal QA procedures on your behalf? What is the facility likely to be doing? What should be expected in the aftermath?

Clayton1 · January 15, 2024, 2:23pm

One would hope they would have had disaster recovery plans to bring in outside cold air, and have executed on it quickly, rather than hoping the chillers got repaired.

All our owned facilities have large outside air intakes, automatic dampers and air mixing chambers in case of mechanical cooling failure, because cooling systems are often not designed to run well in extreme cold. All of these can be manually run incase of controls failure, but people tell me I'm a little obsessive over backup plans for backup plans.

You will start to see premature failure of equipment over the coming weeks/months/years.

Coincidentally, we have some gear in a data centre in the Chicago area that is experiencing that sort of issue right now...

Mike_Hammett · January 15, 2024, 2:49pm

Coincidence indeed…

William_Herrin · January 15, 2024, 2:55pm

Hi Mike,

A decade or so ago I maintained a computer room with a single air
conditioner because the boss wouldn't go for n+1. It failed in exactly
this manner several times. After the overheat was detected by the
monitoring system, it would be brought under control with a
combination of spot cooler and powering down to a minimal
configuration. But of course it takes time to get people there and set
up the mitigations, during which the heat continues to rise.

The main thing I noticed was a modest uptick in spinning drive
failures for the couple months that followed. If there was any other
consequence it was at a rate where I'd have had to be carefully
measuring before and after to detect it.

Regards,
Bill Herrin

Bryan_Holloway1 · January 15, 2024, 3:04pm

I think we're beyond "hypothetical" at this point, Mike ...

Jason_Canady · January 15, 2024, 3:07pm

Our Zayo circuit just came up 30 minutes ago and it routes through 350 E Cermak. Chillers were all messed up. No hypothetical there. It was down for over 16 hours!

sronan · January 15, 2024, 3:14pm

I’m more interested in how you lose six chillers all at once.

Shane

Mike_Hammett · January 15, 2024, 3:18pm

and none in the other two facilities you operate in that same building had any failures.

Mel_Beckman · January 15, 2024, 3:21pm

Easy. Climate change. Lol!

-mel

sronan · January 15, 2024, 3:27pm

Exactly. Perhaps they weren’t all online to begin with…

Jay_Hennigan · January 15, 2024, 6:31pm

Easy. Climate change. Lol!

It was -8°F in Chicago yesterday.

Jay_Hennigan · January 15, 2024, 6:47pm

yes but.... it has been -8 in Chicago plenty of times before this. Very interested in root cause...

Absolutely. My point was that claiming "Global warming" isn't going to fly as an excuse.

Mel_Beckman · January 15, 2024, 6:56pm

My sarcasm generator is clearly set incorrectly

-mel

martinhannigan · January 15, 2024, 10:11pm

+1

Is their design N+1?

https://www.equinix.com/data-centers/americas-colocation/united-states-colocation/chicago-data-centers/ch1

We’re not smashing temp records in Chicago. At least it doesn’t seem so when you look across historical data:

https://www.weather.gov/lot/Chicago_Temperature_Records

HTH,

-M<

William_Herrin · January 15, 2024, 10:32pm

Extreme cold. If the transfer temperature is too low, they can reach a
state where the refrigerant liquifies too soon, damaging the
compressor.

Regards,
Bill Herrin

Lamar_Owen · January 15, 2024, 10:56pm

I’m more interested in how you lose six chillers all at once.
Extreme cold. If the transfer temperature is too low, they can reach a
state where the refrigerant liquifies too soon, damaging the
compressor.
Regards,
Bill Herrin

Our 70-ton Tranes here have kicked out on ‘freeze warning’ before; there’s a strainer in the water loop at the evaporator that can clog, restricting flow enough to allow freezing to occur if the chiller is actively cooling. It’s so strange to have an overheating data center in subzero (F) temps. The flow sensor in the water loop can sometimes get too cold and not register the flow as well.

Karl_Auer · January 15, 2024, 11:02pm

Major double-take there for this non-US reader, until I realised you
just had to mean Fahrenheit.

Regards, K.

Mike_Hammett · January 16, 2024, 1:32am

Someone I talked to while on scene today said their area got to 130 and cooked two core routers.

Barry_Shein · January 16, 2024, 6:48am

Something worth a thought is that as much as devices don't like being
too hot they also don't like to have their temperature change too
quickly. Parts can expand/shrink variably depending on their
composition.

A rule of thumb is a few degrees per hour change but YMMV, depends on
the equipment. Sometimes manufacturer's specs include this.

Throwing open the windows on a winter day to try to rapidly bring the
room down to a "normal" temperature may do more harm than good.

It might be worthwhile figuring out what is reasonable in advance with
buy-in rather than in a panic because, from personal experience,
someone will be screaming in your ear JUST OPEN ALL THE WINDOWS
WHADDYA STUPID?

Saku_Ytti1 · January 16, 2024, 7:08am

Is this common sense, or do you have reference to this, like paper
showing at what temperature change at what rate occurs what damage?

I regularly bring fine electronics, say iPhone, through significant
temperature gradients, as do most people who have to live in places
where inside and outside can be wildly different temperatures, with no
particular observable effect. iPhone does go into 'thermometer' mode,
when it overheats though.

Manufacturers, say Juniper and Cisco describe humidity, storage and
operating temperatures, but do not define temperature change rate.
Does NEBS have an opinion on this, or is this just a common case of
yours?