RE: Quick question.

Alexei Roudnev wrote:
We had 6509 which failed, because backplain failed (it can
not happen -:slight_smile: but it happen) - iof course, no any 'dual
CPU dual power' could prevent it... Image broken line card
- it can crash whole box no matter how much 'dual' things
you have. The same with software error (I crashed one of
6509 just running 'snmpwalk' on it).

I lost a 7507 dual power dual RSP earlier this year: one of the cards
died, something in the power circuitry. It put the entire router in
short-circuit, both power supplies decided to go south and would not
power back up again until the faulty card was physically removed. After
the card was removed it worked fine again. It does not happen often, but
it does happen.

Redundancy is not a slam dunk with IOS though; same as dCEF, don't
expect RPR-compatible images to run every config you'll bump into, YMMV.
There is an annoying number of things that are not working on RPR images
of fall back to route cache instead of distributed cache.

So, I always prefer to have 2 boxes and application level
reliaility instead of playing with 'dual everything'
solutions (last example - 2 days ago one of our dual-power
Intel servers failed because of 1 power supply failure -
it did not broke, but it did something wrong''' and system
crashed...).

Actually, what I try to do for routers is having a "dual everything" for
production and an "el-cheapo eBay special" sitting in the same rack for
backup. The reason I still do dual power and dual CPU is that over the
last 20 years I have seen very little failures of redundant systems
(although I have seen some) however a dual-something saved my bottom
several times. That part of my body is priceless :smiley:

For PCs I install dual Xeons on every production machine for example,
even though the CPU power needed for some is a 486; Intel processors do
die like anything else; a processor dying will typically lead to a
system crash, but it does reboot in single-processor mode when the
graveyard dude pushes the reset button. I also try do have RAID-10
arrays span over two raid cards; same as CPUs, a RAID card that dies
will likely crash the system but it will reboot in degraded mode.

Michel.

Eh really? Whenever I've lost a second CPU (primary or secondary) the machine was a brick until the secondary CPU was gutted and for Piii slotted systems a terminator board was installed in the secondary slot.

What motherboard(s) you using that are holding up to failures like this?

My experience has shown PSU and motherboard failures are faaaaar more common than CPUs.

We are doing the same, including spare-staging hardware from e-bay and 2 CPU
servers for everything (but we still like old and cheap 2x1Ggz servers,
able to do 99% of all tasks).

PS. I like E-bay; last example - our collegues spend a long negotiations and
settle price for Cisco switches with a good discount; 10 seconds e-bay
search revealed exactly the same systems in original boxes (unopened) 10%
cheaper -:slight_smile:

Alexei Roudnev wrote:
We had 6509 which failed, because backplain failed (it can
not happen -:slight_smile: but it happen) - iof course, no any 'dual
CPU dual power' could prevent it... Image broken line card
- it can crash whole box no matter how much 'dual' things
you have. The same with software error (I crashed one of
6509 just running 'snmpwalk' on it).

I lost a 7507 dual power dual RSP earlier this year: one of the cards
died, something in the power circuitry. It put the entire router in
short-circuit, both power supplies decided to go south and would not
power back up again until the faulty card was physically removed. After
the card was removed it worked fine again. It does not happen often, but
it does happen.

Redundancy is not a slam dunk with IOS though; same as dCEF, don't
expect RPR-compatible images to run every config you'll bump into, YMMV.
There is an annoying number of things that are not working on RPR images
of fall back to route cache instead of distributed cache.

So, I always prefer to have 2 boxes and application level
reliaility instead of playing with 'dual everything'
solutions (last example - 2 days ago one of our dual-power
Intel servers failed because of 1 power supply failure -
it did not broke, but it did something wrong''' and system
crashed...).

Actually, what I try to do for routers is having a "dual everything" for
production and an "el-cheapo eBay special" sitting in the same rack for
backup. The reason I still do dual power and dual CPU is that over the
last 20 years I have seen very little failures of redundant systems
(although I have seen some) however a dual-something saved my bottom
several times. That part of my body is priceless :smiley:

For PCs I install dual Xeons on every production machine for example,
even though the CPU power needed for some is a 486; Intel processors do
die like anything else; a processor dying will typically lead to a
system crash, but it does reboot in single-processor mode when the
graveyard dude pushes the reset button. I also try do have RAID-10
arrays span over two raid cards; same as CPUs, a RAID card that dies
will likely crash the system but it will reboot in degraded mode.

Michel.

2 CPU are not for redundancy, but they protects system from crazy process
spending 100% of one CPU (and system still have 50%of capacity).

> For PCs I install dual Xeons on every production machine for example,
> even though the CPU power needed for some is a 486; Intel processors do
> die like anything else; a processor dying will typically lead to a
> system crash, but it does reboot in single-processor mode when the
> graveyard dude pushes the reset button. I also try do have RAID-10
> arrays span over two raid cards; same as CPUs, a RAID card that dies
> will likely crash the system but it will reboot in degraded mode.

Eh really? Whenever I've lost a second CPU (primary or secondary) the
machine was a brick until the secondary CPU was gutted and for Piii

slotted

For PCs I install dual Xeons on every production machine for example, even though the CPU power needed for some is a 486;

What a mad idea.

Intel dont do fault-tolerant SMP.

Running SMP will lower your MTBF just by fact that you now have 2 CPUs. Then there's fact that you're far more likely to hit bugs in OS with SMP than uniproc.

SMP for reliability? Unless it's a multi-million zSeries, What a mad idea...

Michel.

regards,

It is not mad idea - 2 CPU servers are not sugnificantly more expansive as
1CPU (and notice, we count P-IV MMultiThread as 2 CPU) but increases system
redundancy to the run-away processes. Of course, it is not hardware
redundancy, but it REALLY works.

It is not mad idea - 2 CPU servers are not sugnificantly more expansive as 1CPU (and notice, we count P-IV MMultiThread as 2 CPU)

Well, you have to compare like for like, so system with multiple CPUs versus exact system without. No diffference in cost, other than for the CPUs.

And if you want reliability, you're not going to be buying your machines from the nearest Lidl (unless your application is engineered to take advantage of dozens of cheap throwaway PCs).

but increases system redundancy to the run-away processes. Of course, it is not hardware redundancy, but it REALLY works.

Not really.. this is a resource exhaustion problem, and you can not cure this, given buggy apps, by throwing more CPUs at it.

Let's say you have some multi-process or multi-threaded application which regularly spawns/forks new processes/threads, but it is buggy and prone to having individual processes/threads spin.

So one spins, but you still have plenty of CPU time left cause you have two CPUs. Another spins, and the machine starts to crawl. So you solve this problem by upgrading to a quad-SMP machine. And guess what happens? :slight_smile:

Sure, there are some application bugs you can mask a wee bit with SMP, but it's not much cop, its not a solution, and you need an infinite-SMP machine to guarantee that a bad application can never hog all CPU time.

What you really want is a good OS with:

- a good scheduler (to prevent spinning tasks from starving other tasks)

- ability to set resource limits, ie per-task and/or per-user (if your apps run under dedicated user accounts) limits on cpu time, resident memory, etc..

Both of these will allow you to constrain the impact bad tasks can have on the system, whether your machine is 1, 2, ... or n CPUs.

The real solution though is to fix the buggy application.

regards,

--- snip ---

Not really.. this is a resource exhaustion problem, and you can not
cure this, given buggy apps, by throwing more CPUs at it.

Let's say you have some multi-process or multi-threaded application
which regularly spawns/forks new processes/threads, but it is buggy
and prone to having individual processes/threads spin.

So one spins, but you still have plenty of CPU time left cause you
have two CPUs. Another spins, and the machine starts to crawl. So you
solve this problem by upgrading to a quad-SMP machine. And guess what
happens? :slight_smile:

the second cpu buys you time - it is unlikely you're going to be able to
react in time on a busy single cpu box with a runaway process (it launches
into a death sprial almost immediately), but you would usually have 10-15
mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu
affinity for apps that tend to misbehave.

paul

the second cpu buys you time - it is unlikely you're going to be able to react in time on a busy single cpu box with a runaway process (it launches into a death sprial almost immediately), but you would usually have 10-15 mins on a dual cpu box at a minimum or maybe infinity if you enforce cpu affinity for apps that tend to misbehave.

Why do you have 10-15 mins? If the application is multi-threaded and has a reasonable workload, there are plenty of types of bugs that will result in one spinning thread after the other, you need far more than just 2 CPUs! Or maybe your application vendor has "at least 10minutes between hitting bugs!" on it's feature list? :wink:

Really, what you to need do is (in the face of such buggy apps) is to set per-task CPU time resource limits appropriate to how much cpu-time a task needs and how much you can afford - be it a 1, 2 or n CPU system.

paul

regards,

> the second cpu buys you time - it is unlikely you're going to be
> able to react in time on a busy single cpu box with a runaway
> process (it launches into a death sprial almost immediately), but
> you would usually have 10-15 mins on a dual cpu box at a minimum or
> maybe infinity if you enforce cpu affinity for apps that tend to
> misbehave.

Why do you have 10-15 mins? If the application is multi-threaded and
has a reasonable workload, there are plenty of types of bugs that
will result in one spinning thread after the other, you need far
more than just 2 CPUs! Or maybe your application vendor has "at least
10minutes between hitting bugs!" on it's feature list? :wink:

these are observations, pertaining to software products we use a lot -
apache, mysql, apache/suexec, various mtas etc. your point is well taken in
general, but at least When Done Here(tm), dual cpu helps significantly
empirically speaking.

Really, what you to need do is (in the face of such buggy apps) is to
set per-task CPU time resource limits appropriate to how much
cpu-time a task needs and how much you can afford - be it a 1, 2 or n
CPU system.

agreed. however, this degrades performance in certain situations, is not
practical in others and introduces additional complexity (always a bad
thing). the tradeoff is significantly in favor of reactive measures (be they
automatic or human intervantion), at least in most of our installations.

paul

I said - it WORKS. 1 spin - warning - someone opens system and kills a run
away process... Never saw 2 spins (because first one was killed before
second one). Btw, such systems (2 CPU) are even more stable in case of run
away device drivers.

I saw:
- run away tomcat server
- run away CA agent (!@#$)
- run away ssh daemon
- run away sandmail
All regular, at some periods of time. And all processed without any system
degradation because of a few CPU's. The same run-aways on 1 CPU systems
caused visible degradation.

It is all mattter of trade-off - if I must select 1 threaded or 2 threaded
P-IV, I'll select 2 threaded; if I must select from $900 1 CPU and $1100 2
CPU server, I select 2 CPU one.

I call crapola. Modern _modern_ systems may have _some_ of the device drivers
running on seperate CPUs but they're still running in kernel mode.

A runaway device driver means you're toast.

Now, a very very busy device, thats a seperate story. Having one CPU
handle all of your disk/network IO and the second CPU handle all of your
processes may alleviate some of the pain. May. There's more to it than
just offloading stuff. If your processes are all _depending_ on IO to
occur then you may end up with random crappy starvation situations.

This has nothing to do with NANOG. Lets talk about DCEF bugs or something.

Adrian

I am sorry, but I do not make a theory - I just repors practical results. 2
CPU systems are much more stable than 1 CPU system, in my experience. You
are free to find an explanatiion, if you want -:).

Just again. I do not try to explain, I report observations -:).

>
> I said - it WORKS. 1 spin - warning - someone opens system and kills a

run

> away process... Never saw 2 spins (because first one was killed before
> second one). Btw, such systems (2 CPU) are even more stable in case of

run

> away device drivers.

I call crapola. Modern _modern_ systems may have _some_ of the device

drivers

The theory suggests your experience is unusual, or that you're overemphasising one positive contributor towards system reliability of complexity against the negative impacts of complexity.

Again, I'm not arguing that the more complex system (eg SMP) must always be more unreliable, a well-engineered complex system will be more reliable than a simple but badly-engineered system. I know of an SMP PC server that hit at least 4 years uptime (never rebooted while i was in the employ of that company anyway :wink: ), however it would have been just as reliable with just one CPU. And for a large sample of those machines, identical other than single and dual CPU, the set of single CPU machines will be statistically more reliable. Further, for a diverse sample of hardware of varying quality, you will see far more problems with SMP systems - primarily due to software (eg drivers with subtle locking bugs).

Nor am I arguing that the tradeoff of reliability for better performance is unwise, particularly since in this case (SMP systems), CPU failures tend to be rare (unless secondary due to some other failure, eg cooling).

anyway, I'm repeating myself, so i'll stop before susan larts me, and let the list get back to its favoured topic of discussing analogies. :wink:

regards,

Practice suggests that there may well be good reason for this. Mainboards that are set up for 2 CPUs are likely to
be engineered to a much higher standard than your normal chop-shop cheapie special. An interesting experiment
would be to run a 1 CPU system based on the exact same 2 CPU mainboard and if it the level of reliability
would be significantly different.

Tony