Rasberry pi - high density

To some extent people are comparing apples (not TM) and oranges.

Are you trying to maximize the number of total cores or the number of
total computes? They're not the same.

It depends on the job mix you expect.

For example a map-reduce kind of problem, search of a massive
database, probably is improved with lots of cores even if each core
isn't that fast. You partition a database across thousands of cores
and broadcast "who has XYZ?" and wait for an answer, in short.

There are a lot of problems like that, and a lot of problems which
cannot be improved by lots of cores. For example if you have to wait
for one answer before you can compute the next (matrix inversion is
notorious for this property and very important.) You just can't keep
the "pipeline" filled.

And then there are the relatively inexpensive GPUs which can do many
floating point ops in parallel and are good at certain jobs like, um,
graphics! rendering, ray-tracing, etc. But they're not very good at
general purpose integer ops like string searching, as a general rule,
or problems which can't be decomposed to take advantage of the
parallelism.

You've got your work cut out for you analyzing these things!

This thread brings me back to 1985, what with talk of full immersion cooling (Fluorinert, anyone?) and hundreds of amps at 5VDC.... reminds me of the Cray-2, which dropped 150-200KW in 6 rack location units of space; 2 for the CPU itself, 2 for space, and 2 for the cooling waterfall [ https://en.wikipedia.org/wiki/File:Cray2.jpeg by referencing floor tile space occupied and taking 16 sq ft (four tiles) as one RLU ]. Each 'stack' of the CPU pulled 2,200A at 5V [source: https://en.wikipedia.org/wiki/Cray-2#History ]. At those currents you use busbar, not wire. Our low-voltage (120/208V three-phase) switchgear here uses 6,000A rated busbar, so it's readily available, if expensive.

Greetings,

  Do we really need them to be swappable at that point? The reason we swap HDD's (if we do) is because they are rotational, and mechanical things break. Do we swap CPUs and memory hot? Do we even replace memory on a server that's gone bad, or just pull the whole thing during the periodic "dead body collection" and replace it? Might it not be more efficient (and space saving) to just add 20% more storage to a server than the design goal, and let the software use the extra space to keep running when an SSD fails? When the overall storage falls below tolerance, the unit is dead. I think we will soon need to (if we aren't already) stop thinking about individual components as FRUs. The server (or rack, or container) is the FRU.

  Christopher

Greetings,

Do we really need them to be swappable at that point? The reason we
swap HDD's (if we do) is because they are rotational, and mechanical
things break.

Right.

Do we swap CPUs and memory hot?

Nope. Usually just toss the whole thing. Well I keep spare ram around cause it's so cheap. But if CPU goes, chuck it in the ewaste pile in the back.

  Do we even replace

memory on a server that's gone bad, or just pull the whole thing
during the periodic "dead body collection" and replace it?

Usually swap memory. But yeah, often times the hardware ops folks just cull old boxes on a quarterly basis and backfill with the latest batch of inbound kit. At large scale (which many on this list operate at), you have pallets of gear sitting in the to deploy queue, and another couple pallets worth racked up but not even imaged yet.

(This is all supposition of course. I'm used to working with $HUNDREDS of racks worth of gear). Containers, moonshot type things etc are certainly on the radar.

  Might it

not be more efficient (and space saving) to just add 20% more storage
to a server than the design goal, and let the software use the extra
space to keep running when an SSD fails?

Yes. Also a few months ago I read an article about several SSD brands having $MANY terabytes written to them. Can't find it just now. But they seem to take quite a long time (data wise/number of write wise) to fail.

   When the overall storage

falls below tolerance, the unit is dead. I think we will soon need to
(if we aren't already) stop thinking about individual components as
FRUs. The server (or rack, or container) is the FRU.

Christopher

Yes. Agree.

Most of the very large scale shops (the ones I've worked at) are massively horizontal scaled, cookie cutter. Many boxes replicating/extending/expanding a set of well defined workloads.