Got a call at 4am - RAID Gurus Please Read

Nick_Cameo · December 9, 2014, 9:03pm

Server down..... Got to colo at 4:39 and an old IBM X346 node with
Serveraid-7k has failed. Opened it up to find a swollen cache battery that
has bent the card in three different axis. Separated the battery. (i)
Inspect card and plug back in, (ii) reboot, and got (code 2807) Not
functioning....
Return to (i) x3 got same result. Dusted her off and let it sit for a while
plugged in, rebooted to see if I can get her to write-through mode, disks
start spinning. Horay.

Plan of action, (and the reason for my post):

* Can I change from an active (ie, disks with data) raid 5 to raid 10.
There are 4 drives
in the unit, and I have two on the shelf that I can plug in.
* If so, will I have less of performance impact with RAID 10 + write-thru
then RAID 5 + write through
* When the new raid card comes in, can I just plug it in without loosing my
data? I would:

i) RAID 10
ii) Write-thru
iii) Replace card

The new card is probably coming with a bad battery that would put us kind
of in square one. New batteries are 200+ if I can find them. Best case
scenario is move it over to RAID 10+Write-thru, and feel less of the
performance pinch.

Given I can move from RAID 5 to RAID 10 without loosing data. How long to
anticipate downtime for this process? Is there heavy sector re-arranging
happening here? And the same for write-thru, is it done quick?

I'm going to go lay down just for a little white.

Thanks in Advance,

Nick from Toronto.

Supermathie · December 9, 2014, 10:22pm

If the serveraid7k cards are LSI and not Adaptec based (I think they are) you should just be able to plug in a new adapter and import the foreign configuration.

You do have a good backup, yes?

Switching to write-through has already happened (unless you specified WriteBackModeEvenWithNoBBU - not the default) - these (LSI) cards ‎by default only WB when "safe".

If WT, RAID10 much better perf. BUT you just can't migrate from R5 to R10 non-destructively.

- Michael from Kitchener
Original Message

Allen_Kitchen1 · December 10, 2014, 1:15am

+1 on the most important statement below, from my point of view: RAID 5 and RAID 10 are totally separate animals and while you can set up a separate RAID 10 array and migrate your data to it (as soon as possible!!!) you cannot migrate from 5 to 10 in place absent some utter magic that I am unaware of.

10 requires more raw drive space but offers significant write performance advantages when correctly configured (which isn't really too difficult). 5 is fine for protection against losing one drive, but 5 requires much more internal processing of writeable data before it begins the writes and, not too long ago, was considered completely inappropriate for applications with high numbers of writes, such as a transactional database.

Still, 5 is often used for database systems in casual installations just because it's easy, cheap (relatively) and modern fast boxes are fast enough.

Ok, getting down off my RAID soapbox - good luck.

..Allen

Seth_Mos · December 10, 2014, 8:49am

symack schreef op 9-12-2014 22:03:

* Can I change from an active (ie, disks with data) raid 5 to raid 10.
There are 4 drives

Dump and restore. I've used Acronis succesfully in the past and today,
they have a bootable ISO. Also, if you have the option, they have
universal restore so you can restore Windows on another piece of
hardware (you provide the drivers).

in the unit, and I have two on the shelf that I can plug in.
* If so, will I have less of performance impact with RAID 10 + write-thru
then RAID 5 + write through

Raid10 is the only valid raid format these days. With the disks as big
as they get these days it's possible for silent corruption.

And with 4TB+ disks that is a real thing. Raid 6 is ok, if you accept
rebuilds that take a week, literally. Although the rebuild rate on our
11 disk raid 6 SSD array (2TB) is less then a day.

If it accepts sata drives, consider just using SSDs instead. They're
just 600 euros for a 800GB drive. (Intel S3500)

Given I can move from RAID 5 to RAID 10 without loosing data. How long to
anticipate downtime for this process? Is there heavy sector re-arranging
happening here? And the same for write-thru, is it done quick?

Heavy sectory re-arranging, yes, so just dump and restore, it's faster
and more reliable. Also, you then have a working bare metal restore backup.

Regards,

Seth

Stuart_Henderson · December 10, 2014, 10:32am

Even if the hw/firmware supports it, raid level migration is risky enough
at the best of times, and totally insane on a known-bad controller.

Rob_Seastrom2 · December 10, 2014, 1:40pm

The subject is drifting a bit but I'm going with the flow here:

Seth Mos <seth.mos@dds.nl> writes:

Raid10 is the only valid raid format these days. With the disks as big
as they get these days it's possible for silent corruption.

How do you detect it? A man with two watches is never sure what time it is.

Unless you have a filesystem that detects and corrects silent
corruption, you're still hosed, you just don't know it yet. RAID10
between the disks in and of itself doesn't help.

And with 4TB+ disks that is a real thing. Raid 6 is ok, if you accept
rebuilds that take a week, literally. Although the rebuild rate on our
11 disk raid 6 SSD array (2TB) is less then a day.

I did a rebuild on a RAIDZ2 vdev recently (made out of 4tb WD reds).
It took nowhere near a day let alone a week. Theoretically takes 8-11
hours if the vdev is completely full, proportionately less if it's
not, and I was at about 2/3 in use.

-r

Javier_J · December 10, 2014, 10:18pm

I'm just going to chime in here since I recently had to deal with bit-rot
affecting a 6TB linux raid5 setup using mdadm (6x 1TB disks)

We couldn't rebuild because of 5 URE sectors on one of the other disks in
the array after a power / ups issue rebooted our storage box.

We are now using ZFS RAIDZ and the question I ask myself is, why wasn't I
using ZFS years ago?

+1 for ZFS and RAIDZ

Joe_Greco · December 11, 2014, 12:07am

I'm just going to chime in here since I recently had to deal with bit-rot
affecting a 6TB linux raid5 setup using mdadm (6x 1TB disks)

We couldn't rebuild because of 5 URE sectors on one of the other disks in
the array after a power / ups issue rebooted our storage box.

We are now using ZFS RAIDZ and the question I ask myself is, why wasn't I
using ZFS years ago?

+1 for ZFS and RAIDZ

I hope you are NOT using RAIDZ. The chances of an error showing up
during a resilver is uncomfortably high and there are no automatic
tools to fix pool corruption with ZFS. Ideally use RAIDZ2 or RAIDZ3
to provide more appropriate levels of protection. Errors introduced
into a pool can cause substantial unrecoverable damage to the pool,
so you really want the bitrot detection and correction mechanisms to
be working "as designed."

... JG

Bandy_Rush1 · December 11, 2014, 2:25am

We are now using ZFS RAIDZ and the question I ask myself is, why
wasn't I using ZFS years ago?

because it is not production on linux, which i have to use because
freebsd does not have kvm/ganeti. want zfs very very badly. snif.

randy

Gary_Buhrmaster · December 11, 2014, 5:39am

We are now using ZFS RAIDZ and the question I ask myself is, why
wasn't I using ZFS years ago?

because it is not production on linux,

Well, it depends on what you mean by
"production". Certainly the ZFS on Linux
group has said in some forums that it is
"production ready", although I would say
that their definition is not exactly the
same as what I mean by the term.

which i have to use because
freebsd does not have kvm/ganeti.

There is bhyve, and virt-manager can
support bhyve in later versions (but is
disabled by default as I recall). Not
exactly the same, of course.

want zfs very very badly. snif.

Anyone who really cares about their data
wants ZFS. Some just do not yet know
that they (should) want it.

There is always Illumos/OnmiOS/SmartOS
to consider (depending on your particular
requirements) which can do ZFS and KVM.

Bandy_Rush1 · December 11, 2014, 6:32am

zfs and ganeti

Rob_Seastrom2 · December 11, 2014, 11:36am

Gary Buhrmaster <gary.buhrmaster@gmail.com> writes:

There is always Illumos/OnmiOS/SmartOS
to consider (depending on your particular
requirements) which can do ZFS and KVM.

2.5-year SmartOS user here. Generally speaking pretty good though I
have my list of gripes like everything else I touch.

-r

Bacon_Zombie · December 11, 2014, 4:06pm

Are you running ZFS and RAIDZ on Linux or BSD?

Ryan_Brooks · December 11, 2014, 5:25pm

Zfs on BSD or a Solaris like OS

Rob_Seastrom2 · December 11, 2014, 6:48pm

+1 on both. Mostly SmartOS, some FreeNAS (which is FreeBSD underneath).

-r

Ryan Brooks <ryan@hack.net> writes:

Barry_Shein1 · December 11, 2014, 8:25pm

We are now using ZFS RAIDZ and the question I ask myself is, why
wasn't I using ZFS years ago?

because it is not production on linux, which i have to use because
freebsd does not have kvm/ganeti. want zfs very very badly. snif.

I keep reading zfs vs btrfs articles and...inconclusive.

My problem with both is I need quotas, both file and "inode", and both
are weaker than ext4 on that, zfs is very weak on this, you can only
sort of simulate them.

Rob_Seastrom2 · December 11, 2014, 9:57pm

Barry Shein <bzs@world.std.com> writes:

From: Randy Bush <randy@psg.com>

We are now using ZFS RAIDZ and the question I ask myself is, why
wasn't I using ZFS years ago?

because it is not production on linux, which i have to use because
freebsd does not have kvm/ganeti. want zfs very very badly. snif.

I keep reading zfs vs btrfs articles and...inconclusive.

My problem with both is I need quotas, both file and "inode", and both
are weaker than ext4 on that, zfs is very weak on this, you can only
sort of simulate them.

By file, you mean "disk space used"? By whom and where? Quotas and
reservations on a per-dataset basis are pretty darned well supported
in ZFS. As for inodes, well, since there isn't really such a thing as
an inode in ZFS... what exactly are you trying to do here?

-r

James_Hess · December 12, 2014, 2:31am

As for conversion between RAID levels; usually dump and restore are
your best bet.
Even if your controller HBA supports a RAID level migration; for a
small array hosted in
a server, dump and restore is your least risky bet for successful
execution; you
really need to dump anyways, even on a controller that supports
clever RAID level migrations
(The ServeRaid does not fall into this category),
there is the possibility that the operation fails, leading to data
loss, so backup first.

symack schreef op 9-12-2014 22:03:

[snip]

Raid10 is the only valid raid format these days. With the disks as big
as they get these days it's possible for silent corruption.

No! Mistake. It depends.

RAID6, RAID60, RAID-DP, RAIDZ3, and a few others are perfectly valid
RAID formats,
with sufficient sparing. You get fewer extra average random write IOPS
per spindle,
but better survivability, particularly in the event of simultaneous
double failures or even a simultaneous triple-failure or simultaneous
quadruple failure (with appropriate RAID group sizing), which are
not necessarily as rare as one might intuitively expect.

And silent corruption can be addressed partially via surface scanning
and built-in ECC on the hard drives,
then also (for Non-SATA SAS/FC drives), the decent array subsystems
low-level formatted disks with larger sector size at the time of
initialization and slip in additional error correction data within
each chunk's metadata, so silent corruption or bit-flipping isn't
necessarily so silent on a decent piece of storage equipment.

If you need to have a configuration less than 12 disk drives, where
you require good performance
for many small random reads and writes, and only cheap controllers
are an option,
then yeah you probably need Raid10, but not always.

In case you have a storage chassis with 16 disk drives, an integrated
RAID controller,
a solid 1 to 2gb NVRAM cache and a few gigabytes read cache, then
RAID6 or RAID60, or (maybe) even RAID50 could be a solid option for
a wide number of use cases.

You really just need to calculate an upper bound on the right number
of spindles spread over the right number of host ports for the
workload adjusted based on which RAID level you pick with sufficient
cache (taking into account the caching policy and including a
sufficiently large safety factor to encompass inherent uncertainties
in spindle performance and the level of variability for your
specific overall workload).

Barry_Shein1 · December 12, 2014, 3:05am

Disk space by uid (by group is a plus but not critical), like BSD and
EXTn. And the reason I put "inode" in quotes was to indicate that they
may not (certainly not) be called inodes but an upper limit to the
total number of files and directories, typically to stop a runaway
script or certain malicious or grossly irresponsible behavior.

From my reading the closest you can get to disk space quotas in ZFS is

by limiting on a per directory (dataset, mount) basis which is similar
but different.

James_Hess · December 12, 2014, 3:29am

[snip]

From my reading the closest you can get to disk space quotas in ZFS is
by limiting on a per directory (dataset, mount) basis which is similar
but different.

This is the normal type of quota within ZFS. it is applied to a
dataset and limits the size of the dataset, such as
home/username.
You can have as many datasets ("filesystems") as you like (within
practical limits), which is probably the way to go in regards to home
directories.

But another option is

zfs set groupquota@groupname=100GB example1/blah
zfs set userquota@user1=200MB example1/blah

This would be available on the Solaris implementation.

I am not 100% certain that this is available under the BSD implementations,
even if QUOTA is enabled in your kernel config.

In the past.... the BSD implementation of ZFS never seemed to be as
stable, functional, or performant as the OpenSolaris/Illumos version.