HE.net, Fremont-2 outage?

Joe Greco wrote:
>
> With power:
>
> N+1 is usually better than N
> Best to assume full load when doing math
> Things will go wrong, predict common failures
> The best plans are still prone to failure
> Safety margins can save your rear
> etc

I find that electrical panelboards, busways, transfer switches, etc. are
often put in the category of things that don't need maintenance or
routine inspections. Big deal if you can start your fancy generator once
a month (I prefer on-load weekly) but the in between stuff is in
disrepair or full of mice. Even a simple dusty transfer switch could arc
weld itself to once side of the contacts.

Yup. Related: "100% availability" is a marketing person's dream; it
sounds good in theory but is unattainable in practice, and is a reliable
sign of non-100%-reliability.

The most common way to gain "100% availability" is to avoid testing
under load. This surely protects the equipment against a whole slew of
failures in the less-used portions of your power systems, but also
protects you from detecting them outside your Hour(s) Of Greatest Need.

And even for those who follow best practices... You can inspect and
maintain things until you're blue in the face. One day a contractor
will drop a wrench into a PDU or UPS or whatever and spectacular things
will happen. Or a battery develops a strange fault.

You do live load testing, you'll lose now and then. It's best to simply
assume no single circuit is 100% reliable. You should be able to get
two circuits from separate power systems and the combination of the two
should really closely approximate 100%, but even there... it isn't.

... JG

Yup. Related: "100% availability" is a marketing person's dream; it
sounds good in theory but is unattainable in practice, and is a
reliable sign of non-100%-reliability.

You are confusing two different things.

Availability != Reliability.

For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).

And even for those who follow best practices... You can inspect and
maintain things until you're blue in the face. One day a contractor
will drop a wrench into a PDU or UPS or whatever and spectacular things
will happen.

That's were policies, procedures and methods come in (read: SAS70)

Or a battery develops a strange fault.

Get more than one string, one more than one UPS, with monitoring. Batteries are NOT the Achilles heel everyone wants to make you believe they are.

"Question everything, assume nothing, discuss all, and resolve quickly."

-- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben --
-- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --

Joe Greco wrote:

Yup. Related: "100% availability" is a marketing person's dream; it
sounds good in theory but is unattainable in practice, and is a reliable
sign of non-100%-reliability.

The most common way to gain "100% availability" is to avoid testing
under load. This surely protects the equipment against a whole slew of
failures in the less-used portions of your power systems, but also
protects you from detecting them outside your Hour(s) Of Greatest Need.

Not testing under load is silly, IMHO. Does it work? Maybe. If it does
something strange during testing it's attended, expected, and utility is
available to fall back on. Starting your generator only means it'll turn
over and idle, not that it'll provide power under load all the way to
the racks.

Some people may prefer a colo that never risks it and therefore never
does more than idle the genset to claim 100% uptime. Others may prefer
one that won't promise 100% everything but does load tests. I'd rather
have a test go wrong while utility is available rather than a failed
backup with no utility hoping the power comes back before the UPS dies
or the room cooks itself. Both extremes are available to choose from if
you do your research before picking a colo.

And even for those who follow best practices... You can inspect and
maintain things until you're blue in the face. One day a contractor
will drop a wrench into a PDU or UPS or whatever and spectacular things
will happen. Or a battery develops a strange fault.

You do live load testing, you'll lose now and then. It's best to simply
assume no single circuit is 100% reliable. You should be able to get
two circuits from separate power systems and the combination of the two
should really closely approximate 100%, but even there... it isn't.

Separate power systems are overrated, especially if the fire department
ends up being involved for some reason. (Re: the infamous gas leak
story.) And of course with increased complexity comes increased risk of
failure and longer downtime to diagnose and repair. There is no perfect
balance.

~Seth

Alex Rubenstein wrote:

Yup. Related: "100% availability" is a marketing person's dream; it
sounds good in theory but is unattainable in practice, and is a
reliable sign of non-100%-reliability.

You are confusing two different things.

Availability != Reliability.

Pardon the interruption...

In the aforementioned statement, there appears an intense/flagrant -
compartmentalization/separation of terms without sufficient
explanation. Note that in being available, 'a' criteria to ensure
reliability is met. If one has the desire to delve into some of the
nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or
http://ow.ly/zmTB (web friendly). The article is also available
through the IEEE Portal at Services Update (if one of the other links
appear to be unavailable, anytime).

For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).

This explanation, aside from being unsatisfactory, is misleading.
Operating times and maintenance times are very much separate quantities.

And even for those who follow best practices... You can inspect and
maintain things until you're blue in the face. One day a contractor
will drop a wrench into a PDU or UPS or whatever and spectacular things
will happen.

That's were policies, procedures and methods come in (read: SAS70)

For the operationally minded -- on one hand, there is an assumption here
that 'accidents' are not preventable; on the other hand, there is at
least an assumption being made here that SAS 70 is the curative for
'accidents.' To be brief, accounting for human behavior as an
underlying contributor to accidents can be a backbreaking and immensely
messy endeavor. In this respect, SAS 70 can only be assistive.

All the best,
Robert Mathews.