Operate until failure

And what if you are not using APCs?

One issue with highly redudandent data centers is the failure modes are
"interesting." You don't want to shutdown due to a single UPS failure, so
you don't use something simple like PowerChute Plus. You most likely don't
want to shutdown based on any automatic signal. However, you do want a way
for an operator to gracefully shutdown a lot of equipment quickly when
the decision is made.

For a server farm, with potentially thousands of individual systems, is
there any standard piece of software you can install on all of the systems
to act as a receiver of a signal to begin a graceful shutdown that does
not depend on a vendor's proprietary interface? Preferabally one which
does not involve running a lot of additional wires.

I know, everyone says their systems will never fail. Think of this
as the "else" statement for the condition which will never happen.

Again this is only needed if people want a gracefull shutdown. If
you can live with a hard shutdown, you wouldn't require this. If you
use ctrl-alt-del as a normal management practice, I suspect you don't
really require a graceful shutdown.

On Mon, Jan 08, 2001 at 02:35:49PM -0800, Sean Donelan put this into my mailbox:

One issue with highly redudandent data centers is the failure modes are
"interesting." You don't want to shutdown due to a single UPS failure, so
you don't use something simple like PowerChute Plus. You most likely don't
want to shutdown based on any automatic signal. However, you do want a way
for an operator to gracefully shutdown a lot of equipment quickly when
the decision is made.

It should be technically fairly easy to set up an
'Emergency Graceful Shutdown' button to live next to the EPO button; this
controls a line that runs through the data center that activates either
one relay per system or one optoisolator per system, depending on your
fancy, that can raise or lower a particular serial line (DSR, CTS,
whatever). You then install a daemon (again, fairly simple) that
listens to this serial line; when it detects a change, it executes a
graceful shutdown on that system.

If you wanted to get fancy, pushing the "EGS" button could send a series
of pulses that the daemon would have to interpret; this way you guard
against odd line noise or loose connections triggering the shutdown. You
could even set up a 'Cancel EGS' signal, as well.

You could then interface this to more stuff, like the "You'll be out of
diesel for your generator in 120 seconds" alert, etc. etc.

Then again, my sprinklers water my lawn via a cron job, so I might just
be Different.

-dalvenjah

One issue with highly redudandent data centers is the failure modes
are "interesting." You don't want to shutdown due to a single UPS
failure, so you don't use something simple like PowerChute Plus. You
most likely don't want to shutdown based on any automatic signal.
However, you do want a way for an operator to gracefully shutdown a
lot of equipment quickly when the decision is made.

The old Deltec stuff was good about this. They had it so that a server
daemon would notify different groups at different stages.

  Power lost->notify group A (Printers, PCs)
  Low battery->notify group B (Secondary servers)
  Dead battery->notify group C (Primary servers, comms)

They also had different outlets on different "groups", so if a device
wasn't able to understand the network alert (the routers and firewalls
don't have agents), they could be terminated as a part of a group.

Deltec got bought by somebody and I'm sure a lot of this stuff has changed
since I last looked at it, but it was a good design.

* Sean Donelan <sean@donelan.com> [20010108 15:05]:

And what if you are not using APCs?

But still stand alone UPSes? Don't most data centers have larger UPS(es) or
battery plants (say, two) feeding the entire facility? The ones I've worked
in have (well, not *all* of them, but those exceptions had much bigger
issues than worrying about how they were going to shutdown all of the boxes
at once..) And if you aren't using standalone UPSes what do you care what
the interface is to the BigUPS(tm) as long as you can get one of your network
monitoring servers to talk to it (and reliably)? None of your servers in the
server farm are going to be talking to your BigUPS(tm) directly anyway..

One issue with highly redundant data centers is the failure modes are
"interesting." You don't want to shutdown due to a single UPS failure, so
you don't use something simple like PowerChute Plus. You most likely don't
want to shutdown based on any automatic signal. However, you do want a way
for an operator to gracefully shutdown a lot of equipment quickly when
the decision is made.

Agreed. And in this case, the UPS has no involvement. If the operator
wants the servers shutdown, the operator shuts servers down. No UPS
involved (OK, well not literally). I realize this doesn't address your
entire point...one sec I'll get to that.

For a server farm, with potentially thousands of individual systems, is
there any standard piece of software you can install on all of the systems
to act as a receiver of a signal to begin a graceful shutdown that does
not depend on a vendor's proprietary interface? Preferabally one which
does not involve running a lot of additional wires.

Sure, ssh/rsh[1]. :slight_smile: What vendor's proprietory interface -- the OS vendor of
the servers? The UPSes don't have anything to do with the shutdown process
if the operator is the one making the call. To accomplish that it's a simple
matter of scripting a bunch of:

    ssh webserver01 'shutdown -h now Power-Go-Bye-Bye'

Of course, if you have unmanaged (e.g. customer boxes you do not have root
access to) within the same data center, and you want to do the same for
those, that's a whole another story...

Oh, hmm, and Windows. Well, remote command execution is possible there too

At that point, once all servers are gracefull shutdown, you can just shut the
UPS(es) off if you're intent is to eventually cut any and all power to the
facility.

Or did I completely miss your point?

Again this is only needed if people want a gracefull shutdown. If
you can live with a hard shutdown, you wouldn't require this. If you
use ctrl-alt-del as a normal management practice, I suspect you don't
really require a graceful shutdown.

I'm being anal but even ctrl-alt-del is graceful on most modern OSes. The
power or reset button though on the other hand... :slight_smile:

[1] rsh only mentioned for historical reasons, please don't use to manage
the remote power capability of your mission-critical server farm located
in your highly redundant data center unless you understand why you might
consider not doing so. :slight_smile:

-jr

2001-01-08-17:35:49 Sean Donelan:

[...] You most likely don't want to shutdown based on any
automatic signal. However, you do want a way for an operator to
gracefully shutdown a lot of equipment quickly when the decision
is made.

For a server farm, with potentially thousands of individual
systems, is there any standard piece of software you can install
on all of the systems to act as a receiver of a signal to begin a
graceful shutdown that does not depend on a vendor's proprietary
interface? Preferabally one which does not involve running a lot
of additional wires.

I've got my own preference; when running even mere dozens of
machines in a tightly coordinated farm, I want the ability to manage
them all very quickly and easily, so I use a script I wrote for
parallel execution of a command. It takes a command-line and
operates on it with controlled parallelism; it's available at
<URL:Oven - The Premium Domain Oven.com is Now For Sale..

I'll install that on an admin server, which will be a very very
tightly secured machine indeed, since the account which I use on
that machine will have an ssh key, with no passphrase, that's
accepted for running root commands on every machine in the farm.
Given that setup, the answer to your question requires only a
pre-built list of the hostnames or ip addrs of the machines to halt,
at which point it's something along the rough lines of

  multicmd ssh \$1 'sh -c "sleep 10;halt" >&- 2>&- <&- &' \
    <hostlist

or thereabouts; after it's tested I'd save this, and any other
useful invocations, in scripts so I don't have to remember 'em.

For thousands of machines, the default options (10-at-a-time
parallel, 1-second delay between launches) wouldn't be quick;
for an emergency halt program, I'd probably up the parallelism
to whatever my local system could handle well, and drop the
inter-cmd delay to maybe 0.01 sec. And of course for platforms with
software-controllable power switches, the "halt" could be replaced
with an invocation that would power the boxes down.

-Bennett

First thing that comes to mind is a perl script that, given the correct
password/passphrase can `ssh -l [machine] shutdown -h now`, seems pretty
simply to me, assuming you keep a list of all the servers current with a
common RSA auth key or whatnot.

      Matthew S. Hallacy
      XtraTyme Technologies