Tornados in Ashburn

http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html

John Starta <john@starta.org> writes:

http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html

Printer-friendly version for your signin-bypassing pleasure:

I was a little closer to the Ashburn one than I really wanted to be -
was able to see it in the distance to the north (heading south to
north as would be expected given it was from the eastern edge of a
northbound heading hurricane remnant) as I drove along the Greenway.

Notwithstanding an incident report sent out by Equinix at 2012 stating
"The Chiller plant is fully functional", our temperature graphs
indicate that there was a cooling issue at Equinix Ashburn F from 1815
until 1855; the start time corresponds with the
Chantilly/Dulles/Ashburn tornado being in the area of Equinix.

Another tenant at Ashburn F states that there were AC power
disturbances. I can not speak to that; as far as I can tell (no
special instrumentation in my installations) my power was fine.

The reason that I bring this up is that I believe a report which
is posted two hours after the event and glosses over potentially
serious operational anomalies by stating that everything is cool (in
the present tense) does not serve anyone's best interests. I
understand and accept the two hour delay from the start of the
incident, but I expect scrupulous honesty in after-action assessments,
not a marketing-driven assertion that everything is Just Fine.

I encourage the powers that be at Equinix to make public (or at least
send to its customers) a revised statement that truthfully reflects
what happened Friday night.

                                        ---Rob (KE4DJT, spotter FXN16)

I was in the building last night when the weather went bad here :frowning:

It was definatly scramble mode.

When the power went out to the welcome area obviously it got silent and I
could hear the sounds of the magnetic doors releasing.
(I do not like that sound) I saw a loss of HVAC but not a loss of power to
the floor, (by floor I mean customer machines)

I did however talk to a friend of mine last night that administers voip
stuff here and he said they he lost power to a few devices but not all.

To my understanding none of our devices lost power just HVAC.

I expect a full report of the events will be released in a few days but
not today (well not in detail anyway)

TO fully understand what happened and in what order someones going to have
alot of digging to do thru alot of data to get the real story.

As of right now this area is still without power but EQUINIX has 40,000
gallons of fuel. A 500 gallon an hour burn rate, trucks coming with fuel,
and ETA Of midnight for the transformer being fixed in the area.

Dre G.

Robert E.Seastrom wrote:

John Starta <john@starta.org> writes:

http://www.washingtonpost.com/wp-dyn/articles/A29911-2004Sep17.html

Printer-friendly version for your signin-bypassing pleasure:
http://www.washingtonpost.com/ac2/wp-dyn/A29911-2004Sep17?language=printer

I was a little closer to the Ashburn one than I really wanted to be -
was able to see it in the distance to the north (heading south to
north as would be expected given it was from the eastern edge of a
northbound heading hurricane remnant) as I drove along the Greenway.

Notwithstanding an incident report sent out by Equinix at 2012 stating
"The Chiller plant is fully functional", our temperature graphs
indicate that there was a cooling issue at Equinix Ashburn F from 1815
until 1855; the start time corresponds with the
Chantilly/Dulles/Ashburn tornado being in the area of Equinix.

Some additional graphs from F building for last 24 hours:

http://www.deliver3.com/ash/

Another tenant at Ashburn F states that there were AC power
disturbances. I can not speak to that; as far as I can tell (no
special instrumentation in my installations) my power was fine.

The reason that I bring this up is that I believe a report which
is posted two hours after the event and glosses over potentially
serious operational anomalies by stating that everything is cool (in
the present tense) does not serve anyone's best interests. I
understand and accept the two hour delay from the start of the
incident, but I expect scrupulous honesty in after-action assessments,
not a marketing-driven assertion that everything is Just Fine.

Or even acknowledgement that the incident existed, when we called in our temperature spike, we were told "hrm, that's odd, we'll send somebody over to look".

From the NWS:

A tornadic thunderstorm moved into eastern Loudoun County from
western Fairfax County in the vicinity of the Washington Dulles
International Airport. This tornado passed within one half mile of
the National Weather Service forecast office in Sterling. This
prompted the weather forecast office staff on duty to seek shelter
in the safe room constructed in the office. The tornado traveled
north from Dulles Airport... just west of Route 28 into portions of
Ashburn. The tornado produced some damage on the America online
Campus off of Waxpool Rd and more extensive damage to the north in
the beaumeade corporate park. Many trees were snapped and uprooted
along the path of the tornado in the corporate park. Additionally...
three roofs were blown off of buildings and one wall collapsed on
one building. The tornado also tumbled two automobiles into the side
of a building and turned over a tractor trailer. Based on the damage
produced in the corporate park... the tornado reached a maximum
intensity of F2 on the fujita scale.

I have no inside information, I haven't worked for Equinix in over three
year.

Regardless of the company, these things are always written by the
marketing/legal departments in the end. In a sole proprietorship, one
person may do it all. You have to learn how to read the reports. The
fact they sent out a report is a good indication there were problems.
The fact they mentioned cooling is a good indication there were cooling
problems. The fact they didn't mention other things (i.e. no earthquakes,
no tsunami, no volcano) is a good indication those other things weren't
an issue. Its just how marketing/legal departments think.

Despite marketing departments, engineers know there will be failures.
A N+1 design means two faults will result in an interruption. A N+2
design means three faults wil result in an interruption. And so on.

I agree its frustrating when companies won't tell their paying customers
what's happening. I'm not sure its always dishonesty, a lot of the time
the company doesn't know what's happening either. Most companies are
honest in their reporting, as far as what they say. But there is a lot
of "spin."

Despite marketing departments, engineers know there will be failures.
A N+1 design means two faults will result in an interruption. A N+2
design means three faults wil result in an interruption. And so on.

Only caveat here (that I want to add) is this:

1) No matter what the company, no matter what the design, N+x doesn't necessarily mean >x failures have to occur at all, or even simultaneously.

2) Just because a design is believed to be N+x or yN doesn't mean all single points of failure are really eliminated. N+x or yN implies that the failures they planned for have to be >(y-1)N or >x. Doesn't mean that they have planned for every possibile failure mode. For example, static transfer switches can and do fail. Even when they are in pairs, the coupling mechanisms and paralleling mechanisms often don't work and aren't easy to repair/bypass in an emergency.

3) Many new systems [say datacenters built/upgraded in the last 5 years] haven't been around long enough to really test 99.999% and above levels of availability... many new systems won't start showing problems for 5-10 years.

Specifically in Equinix's case:

1) Good that they [seemed] to have maintained partial power.

2) Good that they restored cooling [power to the blowers?] relatively quickly. By the graph someone posted and their message, it looks like their chillers were on an unaffected system, but their blowers weren't [as in, were affected].

3) Good that they seemed to be able to bring together enough knowledgeable folks quickly to resolve the problems that did occur relatively quickly.

4) SLA credits. Depending on your contract, even possible breach unless they can prove >x or >(y-1)N failures had occurred in their physical plant. The latter is only useful if you want to get out of Equinix/Ash or reduce your commits to it.

Deepak Jain
AiNET

<quote who="Deepak Jain">

Specifically in Equinix's case:

1) Good that they [seemed] to have maintained partial power.

2) Good that they restored cooling [power to the blowers?] relatively
quickly. By the graph someone posted and their message, it looks like
their chillers were on an unaffected system, but their blowers weren't
[as in, were affected].

3) Good that they seemed to be able to bring together enough
knowledgeable folks quickly to resolve the problems that did occur
relatively quickly.

I would have to agree. We have a setup in this facility and even with the
quick temperature spike, we didn't skip a beat.

Can't ask for much more than that. It seems to me like things worked
nearly as they should have, and if they didn't, the contingency plans were
effective.

-david

3) Many new systems [say datacenters built/upgraded in the last 5 years]
haven't been around long enough to really test 99.999% and above levels
of availability... many new systems won't start showing problems for
5-10 years.

Past performance is not a guarantee of future results.

Sometimes you get lucky. My residence with no UPS, no backup generator,
no surge protection hasn't lost power in almost 5 years even during
the California rolling blackouts. Nevertheless I wouldn't recommend using
my residence as co-location.

The 5 9s is a bit of a myth and causes some creative statistics. There are
datacenters over 5 years old which have met 100% scheduled availability.
They are rare and probably exceeded their design expectations. All of
them I know about are private data centers, not co-location, and all the
owners have backup data centers because they know one day they will have a
problem. On the other hand, there are many private data centers worse
than professionally operated co-location facilities.

1) Good that they [seemed] to have maintained partial power.

It would be interesting to find out what happened to the two UPSes that
apparently failed. Was it something that exceeded the design, i.e. a
lightning strike greater than X joules? Or something else? Equinix
tests the heck out of their systems, but there is always the potential
for a problem.

2) Good that they restored cooling [power to the blowers?] relatively
quickly. By the graph someone posted and their message, it looks like
their chillers were on an unaffected system, but their blowers weren't
[as in, were affected].

The initial spike looks normal, although a bit bigger than is comfortable.
Chiller plants and compressors take several minutes to reset and restart
when the backup generators come online. The storm may have had some
impact on the recovery because the temperature appears to take a long time
to stabilize.

3) Good that they seemed to be able to bring together enough
knowledgeable folks quickly to resolve the problems that did occur
relatively quickly.

Yep, whatever the problem, restoration that quickly tends to indicate
their team was on the ball. Stuff will always fail. The real test is
how quickly is it fixed.

Sean Donelan <sean@donelan.com> writes:

1) Good that they [seemed] to have maintained partial power.

It would be interesting to find out what happened to the two UPSes that
apparently failed. Was it something that exceeded the design, i.e. a
lightning strike greater than X joules? Or something else? Equinix
tests the heck out of their systems, but there is always the potential
for a problem.

Where did you hear this? If it was posted to NANOG, I missed it.

2) Good that they restored cooling [power to the blowers?] relatively
quickly. By the graph someone posted and their message, it looks like
their chillers were on an unaffected system, but their blowers weren't
[as in, were affected].

The initial spike looks normal, although a bit bigger than is comfortable.
Chiller plants and compressors take several minutes to reset and restart
when the backup generators come online. The storm may have had some
impact on the recovery because the temperature appears to take a long time
to stabilize.

If this is to be expected and normal, then a statement to that effect
("Some customers may note a transient temperature spike of as much as
10 degrees C on their equipment due to designed-in characteristics of
an unplanned transfer of the chiller plant to backup power") in the
customer announcement would have gone a long way towards allaying
fears and creating positive spin. A statement that the "chillers are
OK", when your inlet temperature has just spiked 9 degrees and is
currently sitting six degrees high is simply disingenuous.

Anyway, based on my information (including a couple of phone calls at
the time), suggesting that everything was nominal would be an overly
charitable assessment of the situation.

3) Good that they seemed to be able to bring together enough
knowledgeable folks quickly to resolve the problems that did occur
relatively quickly.

Yep, whatever the problem, restoration that quickly tends to indicate
their team was on the ball. Stuff will always fail. The real test is
how quickly is it fixed.

Absolutely. In case it was not clear in my original message, let me
state for the record:

1) I don't have a problem with facilities being screwed up due to Acts
of God that are outside of the design parameters of the facility. If
an Airbus on short final to Runway 19R at Dulles magically fell out of
the sky on top of Equinix, that would just be spectacularly bad luck,
not Equinix's fault.

1a) In the words of a friend of mine who grew up in Texas, regarding
tornadoes: "The odds of being in the path are actually quite low; the
consequences of being in the path are extremely high". An F2 tornado,
while perhaps not impressive to our friends from the Great Plains,
is capable of causing substantial damage.

1b) No substitute for site diversity if your project is important
enough to justify the cost.

2) Under the circumstances, I think the Equinix staff did an excellent
job of bringing things under control quickly. I'm sure glad this
happened during the day and not at night or on a weekend when due to
cost-cutting measures they have maybe one tech, two max, on duty.

3) I believe that the statements made by Equinix to its customers so
far, are outside the acceptable and expectable envelope of positive
spin to which Sean alluded in a previous message. We're paying
customers, and when things go south we deserve frankness and full
disclosure, not a pep talk.

                                        ---Rob

I was at Dulles airport at the time, and the result was chaos.
Everyone had to go into the basement of the terminal building,
and many people experienced flight delays (mine was about 5 hours).

Regards
Marshall Eubanks

And even when you have site diversity, Murphy and Mother Nature can
still get you.

The federal National Finance Center in New Orleans, LA shutdown due to
Hurricane Ivan, their backup call center is in Cumberland, MD. Ivan
swept through both of them.

http://www.federalnewsradio.com/index.php?nid=22&sid=136393