Most energy efficient (home) setup

I like the Juniper EX2200C switches. They are only 12-port, but have 2 SFPs. They are very low power, and have no fans.

However, I am still waiting (it has been several months) for them to send me the correct rack mount brackets (which are a separate purchase).

-Randy

Leo Bicknell wrote:

But what's really missing is storage management. RAID5 (and similar)
require all drives to be online all the time. I'd love an intelligent
file system that could spin down drives when not in use, and even for
many workloads spin up only a portion of the drives. It's easy to
imagine a system with a small SSD and a pair of disks. Reads spin one
disk. Writes go to that disk and the SSD until there are enough, which
spins up the second drive and writes them out as a proper mirror. In a
home file server drive motors, time you have 4-6 drives, eat most of the
power. CPU's speed step down nicely, drives don't.

Late reply by me, but excellent points.

A combination of mdadm and hdparm on linux should suffice to have a raid that will spin down the disks when not in use. I have used for years a G4 system with a mdadm raid1 (and a separate boot disk) and hdparm configured to spin the raid disks down after 10 minutes and it worked great.

I think in a raid10 this would only spin up the disk pair that has the data you need, but leave the rest asleep. But I didn't try that yet.

What I'd like is to have small disk enclosuer that includes a whole (low power) computer capable of having linux installed on some flash memory. Say you have an enclosure with space for 4 2.5 inch disks, install linux, set it up as a raid10, connect through USB to your computer for back up purposes.

Greetings,
Jeroen

It exists. Google for "unRAID" It uses something like Raid4 for Parity
data, but stores entire files on single spindles. It's designed for home
media server type environments. This way, when you watch a video, only the
drive you are using needs to power up. It also lets you add/remove
mismatched disks with no rebuild needed.

http://lime-technology.com/technology

* Better power management: not all hard drives are required to be spinning
in order to access data normally; hard drives not in use may be spun down.

However modern "green" drives don't take that much power.

PC wrote:

It exists. Google for "unRAID" It uses something like Raid4 for Parity
data, but stores entire files on single spindles. It's designed for home
media server type environments. This way, when you watch a video, only the

There may be a performance penalty using raid4, because it uses one parity disk. Although that system looks like it can be useful for some purposes it looks less ideal for home use. Also I don't see how it would allow you to install your own OS.

Regards,
Jeroen

Once upon a time, Jeroen van Aart <jeroen@mompl.net> said:

There may be a performance penalty using raid4, because it uses one
parity disk. Although that system looks like it can be useful for some
purposes it looks less ideal for home use. Also I don't see how it would
allow you to install your own OS.

For read-mostly storage, there's no penalty as long as there's no disk
failure. The parity drive wouldn't even spin up for reads.

The current Mac mini "Server" model sports an i7 2.0GHz quad-core CPU
and up to 16GB RAM (see OWC for that, IIRC). Two drives, up to 750GB
each, or SSD's if you prefer.

The Mac mini server is quite intringuing with that low power
requirement . Unfortunately... 16 GB _Non-ECC_ memory. I sure
would not want to run a NAS VM on a server with non-ECC memory that
cannot correct single-bit errors, at least with any data I cared
much about..

When you have such a large quantity of RAM, single-bit/fade errors
caused by background irradiation happen often, although at a fairly
low rate. Usually on a workstation it's not an issue, because there
is not a massive quantity of idle memory.

If you're running this 24x7 with VMs and Non-ECC memory, it's only a
question of time,
before silent memory corruption results in one of the VMs.

And silent memory corruption can make its way to the filesystem, or
applications' internal saved data structures (such as the contents
of a VM's registry database).

True can be partially mitigated with backups; but the idea of VMs
blue-screening or ESXi crashing with purple screen every 3 or 4
months sounds annoying.

While I don't disagree with the general thought, one could also say
it's just a matter of time before your server's power supply fails, or
a fan fails, or a hard drive fails.

Since we don't hear about Mac mini server users screaming about how
their servers are constantly crashing, the severity and frequency of
memory corruption events may not be anywhere near what you suggest.

... JG

Since we don't hear about Mac mini server users screaming about how
their servers are constantly crashing, the severity and frequency of

Googling for 'mac mini server crash' gets about 11.6M hits. I gave up after
10 pages of results, but up till that point most did in fact seem to be about
crashes on Mac mini servers (the mail you replied to was on page 8 at
the time).

memory corruption events may not be anywhere near what you suggest.

"the severity and frequency of *noticed* memory corruption events".

FTFY.

(Keep in mind that if the box doesn't have ECC or at least parity, you *won't
know* you had a bit fllip until you dereference that memory location. At which
point if you're *lucky* you'll get a random crash that forces ou to reboot right
away. If you're unlucky, you won't notice till you try to re-mount the disks after
a reboot 2-3 months later....)

With RAID 4, the parity disk IOPS on write will rate-limit the whole LUN...

No big deal on a 4-drive LUN; terror on a 15-drive LUN...

George William Herbert

And silent memory corruption can make its way to the filesystem, or
applications' internal saved data structures (such as the contents
of a VM's registry database).

Since we don't hear about Mac mini server users screaming about how
their servers are constantly crashing, the severity and frequency of
memory corruption events may not be anywhere near what you suggest.

ECC is an absolute MUST. Case closed-
unless you like corrupt encryption keys that blow away an entire volume.

Do you hear of lots of Mac mini server users loading up 16GB of RAM?

Hi,

I've been operating 4 desktop PCs with each the following configuration:
16 GB of RAM (4x4GB Kingston) running Linux about 15 VM (KVM) on DRBD
disks using more than 10 GB of RAM for nearly a year now in a room
without cooling. Over the year I've got one dead HDD and one dead SSD
(both replaced) but no data corruption or host or VM crash.

Do you have reference to recent papers with experimental data about non
ECC memory errors? It should be fairly easy to do (write and read scan
memory in a loop) and given your computations you should get bit errors
in less than a day.

I remember this paper in 2003 but this was using abnormal heat:
http://www.cs.princeton.edu/~sudhakar/papers/memerr-slashdot-commentary.html

Thanks in advance,

Sincerely,

Laurent

I think the simple test for this problem is to take a non-ECC machine, boot
from a CD/USB Key/etc with memtest or memtest86+ on it, and see if you get
errors over the course of a few days.

Getting errors will certainly prove that this problem exists (or that you
have bad ram).

It's not like ECC memory requires a lot of power, a full-blown ATX
board or something; there is the Intel S1200KP Mini-ITX board.

See,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.5936&rep=rep1&type=pdf

But the exact rate of single bit errors in non-ECC memory today is not
necessarily predictable based on past studies from the 90s, and
depends on environment also -- local lightning, solar activity, which
is increasing lately; how much extra shielding you have in place
(Server placed inside a Faraday cage/Lead box ?), etc --- you'd
need measurements for your specific hardware; there are likely
dependencies on the size of the memory cells, the vertical cross
section, other components in the system.

I think the simple test for this problem is to take a non-ECC machine, boot
from a CD/USB Key/etc with memtest or memtest86+ on it, and see if you get
errors over the course of a few days.

Memtest86+ contains a series of tests that help uncover specific
kinds of common memory faults; at any particular point in time, during
a memtest, there is only a confined range of physical memory
addresses under test, a bit flip anywhere else won't be detected.

Which means that Memtest is not likely to detect the error.

Test #11 Bit-Fade with modifications could have some promise; you
need a 24 hour delay instead of a 5 minute delay. You need to
have close to the entire physical address space under test.
And you need truly random bit values stored to some "reliable"
medium, instead of the shortcut of storing known bit patterns.

*Memtest86+ itself and the system BIOS have to be stored in memory or
CPU cache somewhere.
But then again, a random bit flip in non-ECC CPU L2 cache is a
possibility, but software like memtest if suitably modified could be
made to detect a 1-bit error that showed up in the majority of the
memory addresses.

I think that is an overestimate, at least if single-bit (corrected)
ecc errors are as common as flipped bits on non-ecc ram.

Now, First, count me in the "ECC is a must, full stop." crowd. I
insist on ecc for even my customer's dedicated servers, even though most
of the customers don't care that much. "It's not for you, it's for me."
With ECC? if you have EDAC/bluesmoke setup correctly on a supported
motherboard, you get console spew whenever you have a single-bit error.

This means I can do a very simple grep on the box conserver logs to
and I can find all the failing ram modules I am responsible for.
Without ecc, I have no real way of telling the difference between broken
software and broken ram.

That said, I still think the 120 bits a month estimate is large; I
believe that ECC ram should report correctable errors (assuming a
correctly configured EDAC/bluesmoke module and supported chipset)
about as often as non-ecc ram would get a bit flip.

In a past role, I did spend the time grepping through such a properly
configured cluster, with tens of thousands of nodes, looking for failing
hardware. I should have done a proper paper with statistics, but
I did not. The vast majority of servers had zero correctable ecc errors,
while a few had a lot, which is consistent with the theory that ECC errors
are more often caused by bad ram.

(Of course, all these servers were in proper cases in a proper data center,
which probably gives you a fair bit of shielding.)

On my current fleet (well under 100 servers) single bit errors are so rare
that if I get one, I schedule that machine for removal from production.

>> And silent memory corruption can make its way to the filesystem, or
>> applications' internal saved data structures (such as the contents
>> of a VM's registry database).

> Since we don't hear about Mac mini server users screaming about how
> their servers are constantly crashing, the severity and frequency of
> memory corruption events may not be anywhere near what you suggest.

ECC is an absolute MUST. Case closed-
unless you like corrupt encryption keys that blow away an entire volume.

You might want to go tell that to all those Mac users who have full
disk encryption...

... JG

In a past role, I did spend the time grepping through such a properly
configured cluster, with tens of thousands of nodes, looking for failing
hardware. I should have done a proper paper with statistics, but
I did not. The vast majority of servers had zero correctable ecc errors,
while a few had a lot, which is consistent with the theory that ECC errors
are more often caused by bad ram.

I'd have to say that that's been the experience here as well, ECC is
great, yes, but it just doesn't seem to be something that is "absolutely
vital" on an ongoing basis, as some of the other posters here have
implied, to correct the constant bit errors that are(n't) showing up.

Maybe I'll get bored one of these days and find some devtools to stick
on one of the Macs.

... JG

From: Joe Greco [mailto:jgreco@ns.sol.net]

I'd have to say that that's been the experience here as well, ECC is
great, yes, but it just doesn't seem to be something that is
"absolutely
vital" on an ongoing basis, as some of the other posters here have
implied, to correct the constant bit errors that are(n't) showing up.

Maybe I'll get bored one of these days and find some devtools to stick
on one of the Macs.

In all the years I've been playing with high end hardware, the best sample machine I have is an SGI Origin 200 that I had in production for over ten years, with the only downtime during that time being once to add more memory, once to replace a failed drive, once to move the rack and the occasional OS upgrade (I tended to skip a 6.5.x release or two between updates, and after 6.5.30 there were of course no more). That machine was down less than 24 hours cumulative for that entire period. In that ten year span, I saw TWO ECC parity errors (both single bit correctable). On any machine that saw regular ECC errors it was a sign of failing hardware (usually, but not necessarily the memory, there are other parts in there that have to carry that data too).

As much as I prefer ECC, it's not a show stopper for me if it's not there.

Jamie

In a message written on Sun, Apr 15, 2012 at 09:54:14PM -0400, Luke S. Crawford wrote:

On my current fleet (well under 100 servers) single bit errors are so rare
that if I get one, I schedule that machine for removal from production.

In a previous life, in a previous time, I worked at a place that
had a bunch of Cisco's with parity RAM. For the time, these boxes
had a lot of RAM, as they had distributed line cards each with their
own processor memory.

Cisco was rather famous for these parity errors, mostly because of
their stock answer: sunspots. The answer was in fact largely
correct, but it's just not a great response from a vendor. They
had a bunch of statistics though, collected from many of these
deployed boxes.

We ran the statistics, and given hundreds of routers, each with
many line cards the math told us we should have approximately 1
router every 9-10 months get one parity error from sunspots and
other random activity (e.g. not a failing RAM module with hundreds
of repeatable errors). This was, in fact, close to what we observed.

This experience gave me two takeaways. First, single bit flips are
rare, but when you have enough boxes rare shows up often. It's
very similar to anyone with petabytes of storage, disks fail every
couple of days because you have so many of them. At the same time
a home user might not see a failure in their lifetime (of disk or
memory).

Second though, if you're running a business, ECC is a must because
the message is so bad. "This was caused by sunspots" is not a
customer inspiring response, no matter how correct. "We could have
prevented this by spending an extra $50 on proper RAM for your $1M
box" is even worse.

Some quick looking at Newegg, 4GB DDR3 1333 ECC DIMM, $33.99. 4GB
DDR3 1333 Non-ECC DIMM, $21.99. Savings, $12. (Yes, I realize the
Motherboard also needs some extra circuitry, I expect it's less than $1
in quantity though).

Pretty much everyone I know values their data at more than $12 if it
is lost.

Some quick looking at Newegg, 4GB DDR3 1333 ECC DIMM, $33.99. 4GB
DDR3 1333 Non-ECC DIMM, $21.99. Savings, $12. (Yes, I realize the
Motherboard also needs some extra circuitry, I expect it's less than $1
in quantity though).

Pretty much everyone I know values their data at more than $12 if it
is lost.

The problem is that if you want to move past the 4GB modules, things
can get expensive. Bearing in mind the subject line, consider for
example the completely awesome Intel Sandy Bridge E3-1230 with a
board like the Supermicro X9SCL+-F, which can be built into a low
power system that idles around 45W if you're careful.

Problem is, the 8GB modules tend to cost an arm and a leg;

http://www.google.com/products/catalog?q=MEM-DR380L-CL01-EU13&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&um=1&hl=en&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=1043&bih=976&ie=UTF-8&tbm=shop&cid=8556948603121267780&sa=X&ei=HxmMT5btB8_PgAfLs5TvCQ&ved=0CD8Q8wIwAA

to outfit a machine with 32GB several months ago cost around *$400*
per module, or $1600 for the machine, whereas the average cost for
a 4GB module was only around $30.

So then you start looking at the less expensive options. When the
average going price for 8GB non-ECC modules is between $50 and $100,
then you're "only" looking at a cost premium of $1200 for ECC.

For $1200, I'm willing to at least consider non-ECC. You can infer
from this message that I'm actually waiting for more reasonable ECC
prices to show up; we're finally seeing somewhat more reasonable prices,
but by that I mean "only" around $130/8GB.

... JG