Why is my drives built in “SMART” error logging and drive monitoring disabled by Drobo on any drives that are installed?
I noticed this recently when I installed a new hard drive. I had some issues with it as after ~7 hours rebuilding the drive would red light or Drobo would reset. I took the drive out and placed it in my Linux box to analyse it. I checked the SMART data to see if any bad blocks had been logged… SMART was disabled. So I enabled it and placed it back into my Drobo. Again the drive red lighted after a few hours. I took it out again and Drobo had once again disabled SMART logging so I couldn’t see what had caused the drive to red light. I enabled SMART and ran “badblocks” on the drive and sure enough it found bad sectors which were then logged in the SMART data. Needless to say I then sent the drive back to the shop and got a replacement.
But why is SMART disabled by Drobo?? It makes it really hard to see what has gone wrong with a drive and also keep track of it’s performance and health during normal use. It would have been a two minute job to see the bad sectors if they had been logged in the SMART data!!
In a RAID-type situation, either the drives can be left to self-diagnose and recover, or the controller can handle diagnosis and recovery.
If the drives handle things, the RAID performance can be impacted as the drives “take a time out” to fix themselves, which is not good when the RAID controller is sitting there saying “Well… are you done?” and the drive doesn’t respond, which can actually lead to the controller thinking there’s a bigger (non-correctable) problem on the drive.
If the controller handles the diagnosis and repair, then it’s aware of everything that’s going on and won’t think there’s a non-correctable problem when the drive simply needs a sector remap. The problem now is that the drive isn’t individually aware of the error - but in a RAID environment this usually isn’t an issue because the drive won’t be used outside of its set.
So in the Drobo situation, if Drobo marks a drive bad and you need “proof” of it being bad (to get warranty support, etc), you’ll need to pull the drive, re-enable its error detection/recovery, and “exercise” it a while to ensure it catches the badness.
There’s also the point floating around on another thread that perhaps Drobo is ultra-sensitive to even to correctable errors, but this might be a conflict between the on-drive detection/repair and Drobo detection/repair as mentioned in the beginning of this post.
I understand the issue now and I suppose it makes sense.
I guess it now begs the question: if Drobo was handling error detection/recovery, then why didn’t it simply mark the sectors as bad in its firmware and then not use them in future rather than spitting the drive out as bad? Does this mean all my drives will red light on the appearance of only one bad sector? (Although to be honest I doubt I would continue to use a drive with bad sectors anyway so it is perhaps not an issue.)
The more worrying thing was the way Drobo would occasionally just reboot when it detected the error, not something I’d want happening if the drive was in use and a bad sector was encountered on one of the drives.
But anyway, it’s sufficient as a secondary backup solution.
The drives that aren’t “RAID-rated” do not have TLER enabled, which means the drive will “take a time out” to fix errors if it encounters one.
When the time required to remap the bad sector surpasses Drobo’s “uh-oh, the drive might be dying” timeout, then Drobo says “Well, this drive is going bad. Red-light it.”
So it’s a bit of tug-of-war between the drive and Drobo.
IF this is the core problem, then enabling TLER on the drives will prevent Drobo from red-lighting drives that are not encountering unrecoverable errors.
The reboot thing seems to be a more serious issue - not sure where that comes from, maybe some sort of failsafe of “well, the drive really isn’t responding, so let’s try a power cycle” is happening - I have no clue.
If I am correct about the “conflict” between Drobo timing out waiting for the drive and the drive remapping sectors, then I’d place Drobo in the ultra-paranoid category. If the drive is not entirely perfect then Drobo won’t like it.
Somehow Drobo and the drive’s internal diagnostic/correction need to work together, either from Drobo allowing a longer time for the drive’s correction to take place, or the drive not doing correction on its own.