HPE SAS Solid State Drives - Critical Firmware Upgrade Required

I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

Since this is a SSD manufacturer problem does it impact other servers that might have SSD from the same manufacturer???

HP hasn't said who the manufacturer is?

Geoff

Looking at a handful of images and listings online, it appears at least some (?) are Samsung - for example, HP 816562-B21 is just a rebadged Samsung MZ-ILS4800.

Unknown whether it only affects the HPE digitally signed firmware, or all firmwares, though.

Hey Patrick,

I do not normally post about firmware bugs, but I have this nightmare scenario running through my head of someone with a couple of mirrored HPE SSD arrays and all the drives going POOF! simultaneously. Even with an off-site backup, that could be disastrous. So if you have HPE SSDs, check this announcement.

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

Couple years back lot of folk had this problem with many vendors'
optics. There was one particular vendor whose microcontrollers were
commonly used by many vendors and this microcontroller had bug that
after 2**31 1/100th of a second it started to write uptime to memory
location of temperature, and many systems including Cisco and Juniper
didn't react well optic temperatures reaching maximum possible values.

So say you had large network wide upgrades 2**31 1/100th of a second
ago, with enough time between upgrades to ensure that everything works
before continuing on redundant parts. Then you'd suddenly lose like
stack of cards all legs from all devices, no matter how much
redundancy was built in.

Just goes to show that focus on MTBF is usually not a great
investment, it's hard to predict what brings you down and we tend to
bias on thinking it's some physical problem, solved by redundant HW
design, when it probably is not, it's probably something related to
software or operator and hard to predict or prepare for. MTTR focus
will have much more predictable ROI.

I can't really point finger at HP here, these are common bugs and easy
thing to miss for a human. Perhaps static analysis or more complexity
to compiler and compile time guarantees should have covered this.

I’ve been bitten by these sorts of issues before, so I tend to swap one OEM drive in every RAID-1 pair with a retail drive from (if possible) a different vendor. When I re-purpose servers, I try to use drives from two different vendors in each array. That way, if a drive barfs for any intrinsic reason, things keep working.

This can impact performance, but is cheap insurance.

  paul

I think the problem here is that it adds complexity, cost, may impact
support thus contracts and it's not clear if it has saved you any
outage. It becomes belief engineering. I think more useful would have
been to figure out how you can replace that device with data in the
shortest possible window, to cover every unexpected failure mode with
smallest possible outage.