Operate until failure

Is there any consistency among network operators how they operate
their networks when they know a possibility of imminent failure
exists?

1. Do you attempt to preserve service as long as possible, including
running equipment to the point of destruction?

2. Do you attempt to minimize recovery time by shutting down equipment
to a "safe" condition before failure?

If you are running a database/transaction oriented system, I would expect
you want to put the database into a stable condition. On the other hand,
if you are operating mostly communication equipment, you would want to
leave it operating as long as possible.

I'm aware of a variety of proprietary software shutdown programs associated
with UPS vendors. But I'm wondering do any "open standards" exist for
initiating soft shutdowns?

Is there any consistency among network operators how they operate
their networks when they know a possibility of imminent failure
exists?

1. Do you attempt to preserve service as long as possible, including
running equipment to the point of destruction?

To extend as long as you can:

1) Power down as much hardware as possible
2) Pull all redundant cards
3) Pull fan trays

2. Do you attempt to minimize recovery time by shutting down equipment
to a "safe" condition before failure?

Depends on the outage, if you think you can make it then you dont. Things
like pulling fan trays can give you a lot more run time, but may damage
hardware so you need to watch it. If it looks like you may make it you may
want to override your love voltage disconnects on your DC systems. It may
toast your batteries, but if it will get you through an outage it may be
worth it.

If you are running a database/transaction oriented system, I would expect
you want to put the database into a stable condition. On the other hand,
if you are operating mostly communication equipment, you would want to
leave it operating as long as possible.

What I like to do is shutdown the redundant database so you know you have
something to fall back on. You then run the other database into the ground.

I'm aware of a variety of proprietary software shutdown programs associated
with UPS vendors. But I'm wondering do any "open standards" exist for
initiating soft shutdowns?

It very much depend on what you are doing. I like having the control
over what I kill in my network. Of course the best plan is to never
let the above happen, but I don't care how redundant your system, if you
have been in this business a long time you will reach a crash event.
Knowing how to deal with it can extend the event a long time.

<>

Nathan Stratton CTO, Exario Networks, Inc.
nathan@robotics.net nathan@exario.net
http://www.robotics.net http://www.exario.net

1. When I had a power supply fail in a fileserver about a year ago, I
limped it along until my next maintence window (which happened to be in 24
hours, thank goodness) and replaced it then. It was only a 10 minute
downtime for my users who were very happy because there was no downtime
durning business hours. Usually this is what I will do. The less
downtime I can have outside my maintence windows, the better.

2. Depends. If there is a chance I'll break something if I don't shut it
all down, I will. If there is not a likely chance I'll break it, then
great, I'll keep working.

If I have to shut down my database server, I'll switch over to the backup
and keep working and then do the repairs and bring my backup online.

We've had issues here with power outages and usually the UPS' will hold.
The one time they didn't, we went and brought all the machines down
gracefully as we didn't have the auto-shutdown installed on the systems.

While I do realize this is describing the "perfect" problem, there will be
times when a NIC will fail or someone will cut the fiber, and then you
just have to handle it the best way you know how to get the issue
resolved, then take a blunt object (like the clue phone) to the person who
cut the fiber. :wink:

-Eric

Almost all UPS's on the market twiddle with the DTR and RTS signals on the
serial port when a power failure or an imminent battery failure. You can
generally twiddle a pin on the serial cable to shut it off as well.
For APC smart UPS devices, people have reverse engineered the protocol to
communicate in smart mode and get battery voltages and the like. Do a
search for "linux UPS daemons APC" and you should find something of use
(since you will probably have to read the source code to figure out the
protocol, it helps if you know C)
-Paul

By popular request my signature has moved to
<http://198.87.147.226/paulsig.txt>

Paul Timmins
paul@timmins.net
http://www.timmins.net/
"By definition, if you don't stand up for anything, you stand for
nothing."
     ---Paul Timmins

We don't shut anything down with a management call, unless it's going to
fail and break something in the next 15 minutes.

We have a generator, but we have had two amazing coincidences cause it
to fail. The first time, the generator was fine, but the switch didn't
switch. The person who was signing off (erroneously) that he was checking
that switch monthly lost his job shortly before we stopped using his
company entirely. We discovered the problem when the batteries reached
the point where it was supposed to cut over, and the entire data center
went dark. That was a very, very bad day.

The second time, an o-ring blew out, and we dumped so much oil on the
ground, we were told that if it'd been a tiny bit more we'd have had to
call the EPA. This one gave us enough warning to shut things down, but
we had to hustle and a few things were triaged as "let it die, we don't have
time."

In general, however, we start planning for a controlled shutdown the minute
we know there's a problem, and we attempt to schedule that shutdown for
our scheduled weekly outage window if possible. If not, we try to make it
after peak processing time for the affected components.

I recall that ominous depressing feeling sitting by myself in a dark data
center at 3 in the morning, with nothing but the lights of my equipment,
listening to the rectifiers beeping and as the batteries went dead,
hearing machines drop off one by one until it was completely dark...

andy

s/with/without/

I was on the phone with our Managing Director (that'd be three levels
above me) assuring him that everything was reading fine on the monitors,
and the generator was running, so everything would cut over any...

...and then three voices all said "oh shit" at the same time.

And if you are running a late-model linux (preferably RedHat), you can
download APC's own "award-winning PowerChute Plus" software for linux from
their website. It seems to be identical to PowerChutePlus running on
any other platforms, except that the interface is through X86.