RE: Quick question.

Michel_Py1 · August 1, 2004, 4:44pm

Michel Py wrote:
Terminators are a thing of the past; as a matter of
fact, in California and especially in Sacramento
they're called governators now.

Erik Bais wrote:
You mean, they'll be back?

Only once, and this model is obsolete and can't be upgraded to
presidentor.

Paul Jakma wrote:
Intel dont do fault-tolerant SMP. Running SMP will
lower your MTBF just by fact that you now have 2 CPUs.

True; this would be like raid-0 arrays, the more disks the greater the
chance of failure. However, MTBF is not the name of the game here,
availability ratio is. Which is tied to failure rate.

In other words, I don't really care if the second processor reduces the
MTBF from 200k hours to 60k hours, but I do care if the second processor
reduces the time to restore service from 24 hours to 20 minutes (7.5
minutes for SNMP to fail the query twice, 1.5 minute for the tech to
find out that either it's frozen or there's a BSOD, 6 minutes to have
someone go there and reset, 5 minutes to reboot).

The dead processor still has to be replaced, but this is scheduled
maintenance, not outage. A little extra ammo when you have to hunt five
or six nines.

Then there's fact that you're far more likely to
hit bugs in OS with SMP than uniproc.

Unsignificant in my experience, and does not balance what Alexei
mentioned yesterday: a duallie will keep the system up when a faulty
process hogs 100% CPU, because the second one is still available. That
also increases availability ratio.

Michel.

Colm_MacCarthaigh · August 1, 2004, 5:05pm

In other words, I don't really care if the second processor reduces the
MTBF from 200k hours to 60k hours, but I do care if the second processor
reduces the time to restore service from 24 hours to 20 minutes (7.5
minutes for SNMP to fail the query twice, 1.5 minute for the tech to
find out that either it's frozen or there's a BSOD, 6 minutes to have
someone go there and reset, 5 minutes to reboot).

With the right form factor (nice easy-to-open rackmount unit) it will take
just as little time to swap in an on-site cold-spare. That way you get the
nice MTBF and the short restore time. Also, if you have multiple similar
machines, you drastically reduce your spares inventory.

Unsignificant in my experience, and does not balance what Alexei
mentioned yesterday: a duallie will keep the system up when a faulty
process hogs 100% CPU, because the second one is still available. That
also increases availability ratio.

These days you can achieve the same using hyper-threading for example,
and keep the long MTBF

Paul_Jakma · August 1, 2004, 5:06pm

True; this would be like raid-0 arrays, the more disks the greater the chance of failure.

This holds true for most RAID-x levels.

In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot).

If a CPU dies, it's unlikely to come back up without removing the bad CPU, especially if the CPU has become unreliable rather than dying completely. Even if CPU 0 is good and the BIOS has no problems booting the OS, the SMP aware OS will quite probably hit problems with the bad CPU.

If you really want to guard against CPU failures, you need a machine designed for fault-tolerance, not a "cheap" SMP box, those are just *less* reliable.[1]

The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.

Just tape a spare CPU to the inside of the box if time-to-repair is important. Even better, just have a second system on standby.

Unsignificant in my experience, and does not balance what Alexei mentioned yesterday:

Alexei is talking about something else.

a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.

This is a resource problem, not an availibility problem. A spinning application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits.

The real problem with SMP is kernel complexity. Drivers that are rock solid in single-processor can have bugs that are only triggered under SMP. Threaded applications can also become unreliable on SMP systems.

The extra power of an SMP system might be a bonus, but trying to argue their benefits on the basis of reliability is misguided.

Michel.

1. Now, they may still be very reliable, and more than reliable enough for your needs, but they are still not as reliable as the exact same machine with terminators in all CPU sockets/slots bar one The fault-tolerant systems are outrageously expensive.

2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - dont think either supports SMP though ;).

regards,

John_Underhill · August 1, 2004, 7:07pm

If a CPU dies, it's unlikely to come back up without removing the bad
CPU, especially if the CPU has become unreliable rather than dying
completely. Even if CPU 0 is good and the BIOS has no problems
booting the OS, the SMP aware OS will quite probably hit problems
with the bad CPU.

Not necessarily. There have been a number of innovations in recent years in
the area of integrated fault tolerance, including bios level controls over
component monitoring / management. Some of the more upscale Compaq G3
servers for instance, can remove a processor from operation if it exceeds a
threshold of critical errors, (this is also true for memory).
Alphas can boot even if the bootstrap processor fails at system start, and
simply selects the next available processor.. they also have hot swap
processor capabilities, (again for the time being -upscale..). Add onto this
features like hot swap 'raid memory' and pci, redundant pwr, fans, and
drives, and systems can be made to withstand many common component failures,
with little or no interruption in service.
With the advent of technologies like hyperthreading, manufacturers are being
driven by market demands to create more reliable SMP drivers, and I think it
is likely that simultaneous multi-threading will eventually become the
standard.

> a duallie will keep the system up when a faulty process hogs 100%
> CPU, because the second one is still available. That also increases
> availability ratio.

Well it depends.. The real differentiation is if the system is truly
'symetric', that is; dual processor, I/O and memory bus. If both processors
share the same resources, competition between processors for regions of
memory and acquiring locks on the pci bus, severely constrain the available
resources for each processor. So that if a process runs amock on a single
bus architecture, the second processor will not have the resources it needs
to run effectively..

Rob_Seastrom2 · August 1, 2004, 10:37pm

"Michel Py" <michel@arneill-py.sacramento.ca.us> writes:

The dead processor still has to be replaced, but this is scheduled
maintenance, not outage. A little extra ammo when you have to hunt five
or six nines.

MTTR on a single box is irrelevant when you are off playing Ponce de
Leon, hunting the Fountain of Five or Six Nines. Even when your
architecture doesn't depend on any one particular machine (or even whole
big sets of machines) being available, you don't get to "five or six
nines"... just ask Google, Akamai, or Microsoft - there are other
things beyond your control that spoil the picnic first.

As has been observed time and time again, the tried and true way to
make five or six nines of reliability in a system of more than trivial
complexity is to take a lesson from the telcos (the progenitors of the
"five nines" lie) and build a framework and evaluation methodology
that excludes broad classes of unavailability-causing events or
prorates them in such a way as to make them non-reportable. Add to
that list incrementally, until the remaining time listed shows your
target number of nines of reliability. Presto, five nines.

---Rob

Paul_Jakma · August 1, 2004, 10:45pm

Not necessarily. There have been a number of innovations in recent years in
the area of integrated fault tolerance, including bios level controls over
component monitoring / management. Some of the more upscale Compaq G3
servers for instance, can remove a processor from operation if it exceeds a
threshold of critical errors, (this is also true for memory).

Interesting to know. Those usually are due to ECC errors in CPU caches often due to overheating. The CPU is still functional to a degree though, marginal failure as opposed to catastrophic.

But what of electrical failures? Even P4 class machines still share a host bus amongst CPUs no?

Anyway, CPUs (if kept sufficiently cool) tend to one of the more reliable components in a system, if they are good to begin with.

Alphas can boot even if the bootstrap processor fails at system start, and simply selects the next available processor..

Alphas are quite nice, they have support for lockstep operation too. Tandem were supposed to have been moving to Alpha for their Himalaya F-T servers when DEC bought them. Also the 21164 and up (not sure about 21064) AXPs used a point-to-point bus for SMP[1], they were all electrically isolated from each other - at least, a failure of one CPU couldnt affect the other CPUs.

So that if a process runs amock on a single bus architecture, the second processor will not have the resources it needs to run effectively..

Processes running amok still only have access to those resources granted it. Processes generally do not have access to bare IO. What the OS giveth, it can take away (or constrain).

1. Still alive and well in a sense, but now developed into a general purpose PtP local CPU/IO interconnect: AMDs' HyperTransport as used in K8.

regards,

Alexei_Roudnev · August 4, 2004, 5:54am

No need.

Remove disk. Insert isk to spare. Start spare server. Allow techs to analyze
broken server next day.

1 minute. But in reality, 2 CPU servers are redundant to most COPU failures
(had a few cases). Anyway, CPU faiolure is not major reason for server
failures (and never was).

Alexei_Roudnev · August 4, 2004, 5:59am

Alexei is talking about something else.

> a duallie will keep the system up when a faulty process hogs 100%
> CPU, because the second one is still available. That also increases
> availability ratio.

This is a resource problem, not an availibility problem. A spinning
application is not going to take down the machine on any modern OS[2]
and anyway can be dealt with with resource limits, SMP or not,
presuming your OS supports resource limits.

In theory, yes. On pracrtice, 2 CPU improve behavior dramatically. 4 CPU
makes system too complex (as you wrote beloow).

New P-IV with multi threading may be a good selection - behave as 2 CPU
system but is not so complicated as SMP.

The real problem with SMP is kernel complexity. Drivers that are rock

s/is/was/ (5 years ago). Now most kernels are SMP. I agree that SMP kernels
are much more complicated, but we _already_ paid this price.

In reality, applications are less reliable on 2 CPU systems (if they have
some kinds of bugs, which make sense on SMP only),
so I agree with you in some cases.

Paul_Jakma · August 4, 2004, 6:48am

In theory, yes. On pracrtice, 2 CPU improve behavior dramatically.

That is not about reliability. That's to do with software performance.

I was purely picking a, admittedly pedantic, nit with the notion that SMP == more reliable. I'm not trying to argue that SMP does not have other benefits (eg performance).

4 CPU makes system too complex (as you wrote beloow).

Nah, the big jump in complexity appears to be from no-concurrency to concurrency. After that initial hurdle, 2 to 4 to 8 CPUs isnt as big a deal (making it scale is though).

New P-IV with multi threading may be a good selection - behave as 2 CPU system but is not so complicated as SMP.

From the OS POV, the complication is the same. And yes, even

single-processors are today capable of presenting multiple execution contexts to software, and it seems to be a trend we'll see more and more of.

In reality, applications are less reliable on 2 CPU systems (if they have some kinds of bugs, which make sense on SMP only), so I agree with you in some cases.

Right..

regards,