Are all my WD (green caviar) drives bad????

Hi everyone,

I’m new to the forums, if only because I’ve been a happy Drobo user for the last 18 months. But over the last few days something strange has happened, and I’m looking for help from some of the experts here.

My setup is as follows:

  • Drobo 2nd gen (4 slots, USB + FW)
  • Firmware Version: 1.3.7
  • Dashboard Version: 1.7.3
  • My Platform: Mac OS 10.6.6
  • Connection: FireWire 800
  • Disk pack (until a few days ago): 2x WD20EARS-00MVWB0 + 2x Seagate ST3500630AS-3.AKK
  • File System: HFS+

This Drobo had been working OK for the last 18 months (the original disk pack had 2x Maxtor6V300F0 that were replaced by the 2x WD20EARS-00MVWB0 about 3 months ago).

A few days ago, and within a 24h window my Drobo reported:

  1. Bad drive on one of the WD20EARS-00MVWB0 (blinking red): I pulled it out, following recommended procedure
  2. Idem for the second WD20EARS-00MVWB0 (blinking red): now Drobo dashboard obviously indicated “Data at risk” and “Critically low on capacity”;
  3. I also pulled it out and replaced it by a spare WDCWD10EAVS-00D7B0- which was as well reported as “bad” within a few minutes after starting re-layout.

These 3 WD drives work OK in other external enclosures (non-RAID, non-Drobo), and I’ve tested them extensively (SMART, disk tests, etc.) with HardDisk Sentinel Pro. They’re reported as 100% healthy, no errors.

I’ve opened a ticket with Technical Support but so far it hasn’t been very helpful (first recommendation I was given is to RMA the 3 drives, now they’re asking me about the Drobo power supply brick???).

I’ve searched extensively the forums and I’ve read about potential issues (or not:-) with the WD Green Caviar drives in Drobo (or in RAID configurations to be more precise), and about Drobo detecting in advance disk failures beyond what regular SMART does, but I can’t find a way out.

It seems highly unlikely to me that these 3 drives can fail within a 24h window (and I’m afraid I can’t buy either the “bad batch” possibility suggested by DRI technical support, since they’re purchased separately, with several months in between orders, which IMO reduces to nil the probability of such event).

Anyway, I’ll appreciate any help you can provide. I can share the diagnostic logs (decrypted) generated by Drobo dashboard after each alert -as I already did to DRI technical support-

Best regards

Antonio

its quite possible that the drobo is starting to die

if you have a “good” disk pack - i would remove that and set it to one side

try using a different disk in drobo (which will be wiped) and make a new disk pack and then etst that - see if it produced the same kind of errors - you can always replace your original disk pack

ALWAYS change the WHOLE disk pack at once with drobo off, while testing

Dochris,

thanks for your reply.

I’ll try to get hold of two drives (non-WD) and test them:

  • as a disk pack
  • together with the 2x Seagate that seem to work OK so far according to my Drobo

and see what I get

However, do you think it would be useful if I post in this thread whatever I find as relevant in the diagnostic logs so some of you experts can figure out what’s going on with my Drobo?

regards

Please post.

m.

Here it goes (I’ve pasted only what I think is relevant; if you think I should post something else, just tell me what to look for, since the log file contains several sections). For what I can see, the three drives gave FULL TIMEOUT errors.

From diagnostic log after 1st HD failure:

[…]
2011-03-13,20:09:10: ELM: Sun Mar 13 20:09:10 2011: DPM: Refusing EXPUNGED slot 0 [FULL TIMEOUTS] WWN=>WD-WMAZA3285189 WDCWD20EARS-00MVWB0 51.0AB51
2011-03-13,20:09:10: SPMM::DiskFail SlotStatus[0] = 0x2
2011-03-13,20:09:10: SPMM::DiskFail setting slot 0 bad!
[…]
Sun Mar 13 20:09:10 2011: DPM: Refusing EXPUNGED slot 0 [FULL TIMEOUTS] WWN=>WD-WMAZA3285189 WDCWD20EARS-00MVWB0 51.0AB51
Sun Mar 13 20:11:56 2011: DPM::hotPlug: SlotNumber = 0, op = REMOVE
Sun Mar 13 20:11:56 2011: DPM::hotplug: disk REMOVED from slot 0 has no associated LD, probably expunged
Sun Mar 13 20:12:31 2011: DPM::hotPlug: SlotNumber = 0, op = ADD
Sun Mar 13 20:12:32 2011: DPM::hotPlug: Refusing EXPUNGED slot 0 [FULL TIMEOUTS] WWN=>WD-WMAZA3285189 WDCWD20EARS-00MVWB0 51.0AB51
Sun Mar 13 20:12:32 2011: EPM::tad: EXPUNGE COMPLETE ON SLOT 0 CLEARING PERISTENT FLAG
[…]

From diagnostic log after 2nd HD failure:

[…]
Mon Mar 14 02:37:52 2011: J2::hotPlug: REMOVAL of disk from slot 1 with DIRTY journalette
Mon Mar 14 02:37:52 2011: DPM::hotPlug: SlotNumber = 1, op = REMOVE
Mon Mar 14 02:37:52 2011: DPM::hotplug: disk REMOVED from slot 1 has no associated LD, probably expunged
Mon Mar 14 02:37:57 2011: LM: 2 Zones were repaired using fast resync
Mon Mar 14 02:37:57 2011: DPM::deleteLogicalDisk: LD #3 REMOVED from disk pack
Mon Mar 14 02:37:57 2011: DPM::updateDis: writing 0x8a3256bc88413039/41 to LDs #0 1
Mon Mar 14 02:37:57 2011: LM: Degraded relayout finished
Mon Mar 14 02:38:11 2011: J2::hotPlug: added disk was DIRTY when removed, replaying from slot 1 to disk 1
Mon Mar 14 02:38:11 2011: DPM::hotPlug: SlotNumber = 1, op = ADD
Mon Mar 14 02:38:13 2011: DPM::hotPlug: Refusing EXPUNGED slot 1 [FULL TIMEOUTS] WWN=>WD-WMAZA3346693 WDCWD20EARS-00MVWB0 51.0AB51
Mon Mar 14 02:38:13 2011: EPM::tad: EXPUNGE COMPLETE ON SLOT 1 CLEARING PERISTENT FLAG
[…]

From diagnostic log after 3rd HD failure:

[…]
Mon Mar 14 10:32:08 2011: DPM::hotPlug: SlotNumber = 1, op = ADD
Mon Mar 14 10:32:09 2011: DPM::addLogicalDisk: Assigning LD #3 to (size: 931.51GiB) slot 1 WWN=>WD-WCAU41004736 WDCWD10EAVS-00D7B0 01.01A01
Mon Mar 14 10:32:09 2011: DPM::updateDis: writing 0x8a3256bc88413039/43 to LDs #0 1 3
Mon Mar 14 10:32:09 2011: SM: add physical disk 1 (logical disk 3)
Mon Mar 14 10:32:09 2011: Capacity: Free=75.69GiB(8.22%), Used=846.08GiB(91.78%), Total=921.77GiB, Unprotected=232.94GiB.
Mon Mar 14 10:32:09 2011: LM: rebuild initiated as logical disk 3 added
Mon Mar 14 10:32:09 2011: LM: Moved tertiary copy of the ZIS to : 3
Mon Mar 14 10:32:10 2011: LM: Relayout initiated as Zone 862 uses a missing disk
Mon Mar 14 10:46:26 2011: EPM::trackAndDispatchError: power cycling slot 1
Mon Mar 14 10:46:26 2011: DPM::hotPlug: SlotNumber = 1, op = POWER_CYCLE
Mon Mar 14 10:46:26 2011: DPM::hotplug: LD #3 matches disk in slot 1 WWN=>WD-WCAU41004736 WDCWD10EAVS-00D7B0 01.01A01
Mon Mar 14 10:46:26 2011: ZM: Marking region 5(3:114) unusable in zone 1 (offset 0x1f438000 copy #0
Mon Mar 14 10:46:27 2011: ZM: Marking region 4(3:113) unusable in zone 1 (offset 0x19f28000 copy #0
Mon Mar 14 10:46:42 2011: DPM::updateDis: writing 0x8a3256bc88413039/44 to LDs #0 1
Mon Mar 14 10:46:43 2011: J2::hotPlug: REMOVAL of disk from slot 1 with DIRTY journalette
Mon Mar 14 10:46:43 2011: SM: remove physical disk 1 (logical disk 3)
Mon Mar 14 10:46:43 2011: DPM::hotPlug: SlotNumber = 1, op = REMOVE
Mon Mar 14 10:46:43 2011: Capacity: Free=0B(0.00%), Used=846.08GiB(100.00%), Total=846.08GiB, Unprotected=0B.
Mon Mar 14 10:46:43 2011: DPM::hotplug: disk REMOVED from slot 1 has no associated LD, probably expunged
Mon Mar 14 10:46:48 2011: DPM::deleteLogicalDisk: LD #4 REMOVED from disk pack
[…]

Uh?

Now, without any further action on my side -since I’m still looking for drives that I can use to replace/test the WDs- Drobo has removed the “bad drive” alert on the WDCWD10EAVS (I didn’t remove it yet), and has started re-layout (36 hours to go :slight_smile:

What’s going on here?

Shortly after I bought my Drobo, it began to lock up and stop responding to the host PC. Sometimes one of the disks was marked as “failed,” but it would later be brought back online.
Pulling the suspect disk didn’t solve the lockup problem. Tech support looked through my logs and found that a different disk was causing errors and was likely triggering both the lockups and the intermittent errors on the other disk.
I replaced only the disk identified by tech support and all my problems went away.

I don’t pretend to understand the mechanism, but perhaps you too are being confounded by a Disk Gone Wild. Talk with tech support and/or check your logs for clues.