Drobo

7 blue lights... from the right?

I was deleting a few gigabytes of data from my Drobo this morning, when it threw some errors about referencing deleted inodes.

Dec 20 11:59:19 aeus kernel: [79778.109068] EXT3-fs error (device sdb1): ext3_free_inode: bit already cleared for inode 124793
Dec 20 12:00:11 aeus kernel: [79830.367345] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:00:11 aeus kernel: [79830.467250] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 320233
Dec 20 12:00:28 aeus kernel: [79847.008332] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:00:28 aeus kernel: [79847.052785] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 320233
Dec 20 12:00:30 aeus kernel: [79849.727485] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:00:31 aeus kernel: [79849.880842] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:00:48 aeus kernel: [79866.953850] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:00:50 aeus kernel: [79868.886855] EXT3-fs error (device sdb1): ext3_lookup: deleted inode referenced: 175678
Dec 20 12:02:41 aeus kernel: [79980.126974] EXT3-fs error (device sdb1): ext3_check_descriptors: Block bitmap for group 1152 not in group (block 0)!

The Drobo was serving up a mounted fs, so I unmounted it, sync’d the fs, shut the drobo down (via drobom shutdown), then halted the physical server it was attached to.

When I restarted the server and plugged back in the Drobo, it did it’s startup dance, and the blue lights started moving from left to right… and then stopped.

It’s been sitting there with 7 blue LEDs lit, from right to left, with the leftmost 3 LEDs unlit. The underlying device (sdb) isn’t seen by the OS at all.

I’m hoping that it’s just running its own internal fsck on the device, but I can’t be sure.

What is it doing, and why isn’t it coming up?

UPDATE

It’s been almost 12 hours and it’s still sitting there with lights 4-10 lit in blue.

Should I power cycle it?

Should I let it run over night?

Is there no way to get ANYTHING useful out of it? No console? No diagnostics?

The OS itself doesn’t recognize it as a valid device, because the Drobo hasn’t fully “booted” yet.

p.s. This has been running fine, 24x7 for at least 7 months. I upgraded to the 1.3.5 firmware last night, and now it’s misbehaving.

HELP!!

24 hours and still not a single change in the number of LEDs lit.

What should I do here? I have a LOT of data on that Drobo that I can NOT lose!

Does anyone even READ these forums at all anymore?

Sure, people read the forum… but this is not an official support mechanism, meaning that while DRI employees may be present and help at times, it’s primarily for customers, as the first line at the bottom says “This is a moderated community forum…”

Moreover, the DRI employees (Jennifer mainly) don’t work over the weekend.

Regardless, since your data seems to be of utmost importance to you (and it would be for me as well), I HIGHLY recommend NOT unplugging (leave things be) then call support before you do anything else. If your house was on fire, you’d call the fire department, not post on a forum, right? :slight_smile:

I fear you might end up unintentionally making the problem worse by trying to fix it. I know I’ve fallen into that trap numerous times in my tech career…

Good luck, and please keep us updated on your status.

Yes I do, please remember we do not work on weekends. Today is Monday, and I read the forums in order.

Honestly, it sounds like your drobo has hung in the boot process.

I would suggest shutting down your computer. Unplug the data cable, then the power.
Turn your computer back on. Plug the data cable back in then the power.

Support was useless. I was told by support to back up the data and reformat the Drobo.

How can I back it up if the Drobo won’t complete booting to allow me to mount it and get to the data?

Then I was told that it must be an issue with the disk pack, because an empty Drobo boots fine. That’s a silly assumption, because the Drobo trashed the filesystem on the disk pack, so of course it can boot an empty Drobo, but fail to boot/fix the filesystem of the disk pack it trashed.

The bits and bytes are still on the disk platters, but since I can’t see any output at boot time from the Drobo itself, I don’t know where it is stuck, or if it is stuck, or if there’s something else happening that 7 lit, blue LEDs aren’t telling me.

After searching the forums, I’m strongly leaning towards this being a firmware bug with 1.3.5, and I should never have upgraded to that version. 24 hours after I upgraded, the Drobo started having trouble.

Is it an easy process to downgrade back to 1.3.1 (the version it was running for 7 months, 24x7).

i dont know about downgrading to that specific firmware. i know some of them woudlnt recognise disk packs which had been used with later firmwares (upgrading the firmware also lead to the pack being updated).

Personally what i would try doing (this is not entirely without risk so continue at your peril!)

Boot drobo with NO disks - and it will finish booting (so you say)

now add a single disk from your pack and it should recognise it and highlight it as red and say that too many drives have been removed and to please re-add the removed drives… and then you can continue adding the drives one at a time.

if it is something about the pack stopping your drobo from booting - then let it boot THEN add the pack.

if it mounts the pack successfully - then yay! if not then at least this way you will be able to pull a diagnostic file off it and find out what is going on.

if you can find the PHP script on here which translates drobo diags to plain text, i think the boot sequence logging is fairly verbose.

if that doesnt work you could see if you can downgrade the firmware (with NO drives in) then see if it will boot with that disk pack - but it may not recognise a pack from a later firmware or it could do something horrible to it)

Do let us know how you get on.

Really? So what if the upgrade of that “pack” fails, which appears to be the case here? What then? Writing the firmware to the disk pack is a BAD idea.

Won’t adding a 2nd-4th drive cause the Drobo to try to resilver drives 1 and 2, and then adding drive 3 will further resilver that spread, thus trashing any data on the platters themselves?

That can be done? Non-destructively? I can boot an empty Drobo, and load in an existing, full-of-data drive pack, one disk at a time? Without it trying to rebuild the disk pack?

I can’t do much, since the Drobo never completes the boot process. I suspect it’s sitting there at the underlying VxWorks OS level waiting at a prompt for someone to answer a question, like “Do you REALLY want to repair these broken inodes? [Y/n]”, but since there is no console, I obviously can’t acknowledge any input it is asking for.

Since it’s also a complete black box, I can’t tell if it’s crashed, waiting for input, or just busy repairing the filesystem it trashed on me.

They encrypt the logs now, that PHP script no longer works. Been there, tried that.

If downgrading the firmware trashes the data on the drive, Drobo has done something HORRIBLY wrong with their design.

At this point, I’m curious how I’m supposed to recover the data… I can’t send the disk pack out for data recovery, because no company out there understands the proprietary Drobo resilvering format that I’m aware of.

Ideas?

sorry i didnt phrase it correctly. i think when a pack is booted in a drobo with a newer firmware - the firmware makes a note of it in the pack, and it cant be used with a lower firmware after that. i know they did this once or twice with the lower drobos (v1) a couple of years back, but that may be because it tweaked the way it fundamentally operated, they may not do this anymore for relatively minor firmware bumps and your pack may still work perfectly with the 1.3.1 firmware.

No - drobo will only wipe disks to add them to a pack. your disks are all from the same pack. If you think about the message - it says too many drives have been removed please re-add them - so obviously it wants/needs to data on those drives, it woudlnt wipe them to try and add them to a broken/unusable pack.

i THINK so, but i will not guarantee it and do so at your own risk. if you put in one drive - it will give you the warning asking you to add the rest of that pack “too many drives have been removed - please replace a drive”.

personally i can tell you that i have done this and it did work for me (it did not wipe the drives as i added them). but it does only tend to be done as a troubleshooting measure!

apparently its just a bigger XOR pad, one of the guys in the other thread knows what they use now - maybe PM him?

if you did downgrade the firmware (with an empty drobo), then power down, insert your pack, and boot, i would be AMAZED if it did trash the pack. i would expect it to , at worst, spit out an error.

think about it - lets say that your drobo (the physical thing) dies. the whole idea is you can pop the pack out and use it in another replacement drobo- what if that shipped with a lower firmware? its got to be able to handle it gracefully.

i think that

  1. inserting disks one at a time (and paying attention to what the error messages say) will not be destructive

  2. downgrading firmware with an empty drobo (if that is possible), and then bootnig with your pack in will not be destructive.

however it is your data and nothing is wholly without risk, so this si just my advice and im not taking any responsibility.

personally, i think you should get someone competent on the phone at support - ask them about the two options i have suggested, and then go through them one at a time - with the support guy still on the phone.

good luck!

PS>

if all else fails - DRI do have a special firmware that they used to give out - a READ ONLY firmware. for when things when horribly wrong - it would load a disk pack (pretty much regardless o whether drives are flagged bad or not) in read only mode (to not mess things up further). to allow you to get your stuff off.

it may be worth enquiring about that too

All stop.

After downgrading the firmware:

“The set of disks you have inserted are incompatible with the Firmware running on your Drobo. There may be an update that allows you to use these disks. Would you like to check or updates?”

Ugh.

well at least it didnt trash it ? :slight_smile:

i’d go for putting the new firmware on. let it boot. then insert your disk pack one at a time.

it will be safe to try inserting the first disk - and it will recognise it as a broken pack, and should prompt you to insert the “re-add” the next one :slight_smile:

and at least hopefully you can get a diagnostic with the disk pack in place , to send to DRI

After a series of “Too many hard drives have been removed. Please re-insert the removed hard drives.” messages even after all drives were inserted, and Drobo Dashboard disconnecting and reconnecting dozens of times, I now have a Drobo with all red drive light LEDs lit, all 10 blue LEDs lit, and the solid green LED lit.

What does THIS mean?

all red lights means it isnt a mountable disk pack.

it almost sounds like two drives have been flagged as bad at once.

i think your best bet may be to see if that “mount drives no matter what in read only mode” Firmware is available.

im guessing since its running, and the pack is in there, you can take a diagnostics too and send that the DRI, hopefully they can tell you which drives are causing the problems, and why?[hr]
I know its not a huge step forwards, but its booted with a pack in there, that minor progress. now just getting the data off it is the next challenege!

Is that something DRI provides? Or some hacked firmware here in the forums?

It’s not “running”, since DD can’t query it and show me the pie graph. All I get is a repeating dialog that too many hard drives were removed, and it goes away then displays again.

I’m not any closer than I was before. The physical drives are fine, they are not “bad” in any way. The Drobo was fine until I upgraded to 1.3.5, and now after rebooting it and trying to delete some data from it, the Drobo trashed the underlying ext3 filesystem on the disk pack, and it can’t read or repair it.

What I need, is some sort of output, so I can see what is actually happening here, so I can diagnose it. If this were a Linux machine, I’d have dozens of tools at my disposal to diagnose and fix the problem.

This black-box mentality will never succeed because of situations precisely like this, where the hardware destroys the configuration, but there’s no way to see what it’s doing under the hood.

I ABSOLUTELY CAN NOT lose the data on these drives. That is non-negotiable.

i wonder… and this i almost just thinking out loud/making stuff up

we all know drobo flags a drive as “bad” at some point when IT determines that the drive isnt usable - this could be because of timeouts or whatever else black magic drobo uses to establish these things. very often the drive will continue to function perfectly normally in a desktop computer.

The fact this co-incides with a firmware update, i wonder if they have changed the tolerances. e.g. previously two of your drives have 8 timeouts each (say), and drobo had a limit of 10 before flagging the drive as bad. what if in the new firmware they changed that limit to be 5 timeouts - so on first boot drobo sees two drives with what it now judges to be “failure” conditions and simultaneously fails them both?

that sounds plausible to me, but it is total conjecture.

if it is something “trivial” like that, then the read only firmware should be able to allow you to get your data off there ok.

I’d be interested to know SuiteB or Bhiga’s opinions on this, but i think its a bit early for them to be on here.[hr]
The read only firmware was troubleshooting tool DRI support did used to send out. I think it was developed primarily because seagate drive had an issue where they were hanging for up to 30 seconds, so drobo’s were regularly failing them and breaking disk packs, but of course the drives and the data were fine, you just needed to be able to get at them, hence the read only - mount bad disks - firmware. of course all of this was relatively well documented in the OLD forums!

i’d get on the phone and see if it is available.

Powering the Drobo (with all lights lit) off and powering it back on, results in the same exact thing that I saw when I did this on my Linux machine: 7 blue LEDs lit, green LED blinks for a few minutes, then goes solid… and then nothing.

I’ve pulled the diagnostics file, and am trying to find the tools to parse it. No luck so far.

and cant you take a diagnostic file from it in its current state? doe sit have to be mounted to do that? (ive never tried)[hr]
haha - we’re posting at cross purposes…

I just spent quite a bit of time on the phone with Support talking about the issues, running over what I’ve done, what may have caused the problem, and sent them the diagnostic file.

Apparently there is a (quote) “…lot of conflicting information here in the diagnostic, so we’ll have to escalate this…”

I’m not exactly sure what that means, but it doesn’t sound good.

I did ask about the “read-only” firmware, and the tech indicated that its only used as a last-resort, because it’s an irreversible operation, and once you use it, there is no going back.

I did find the php script that decodes the log, and what I see… isn’t good. These messages go on for miles:

2009-12-20,17:24:19: the beginning of the group is 473989120 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14466: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587268096 2009-12-20,17:24:19: the beginning of the group is 474021888 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14467: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587300864 2009-12-20,17:24:19: the beginning of the group is 474054656 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14468: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587333632 2009-12-20,17:24:19: the beginning of the group is 474087424 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14469: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587366400 2009-12-20,17:24:19: the beginning of the group is 474120192 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14470: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587399168 2009-12-20,17:24:19: the beginning of the group is 474152960 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14471: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587431936 2009-12-20,17:24:19: the beginning of the group is 474185728 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14472: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587464704 2009-12-20,17:24:19: the beginning of the group is 474218496 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14473: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587497472 2009-12-20,17:24:19: the beginning of the group is 474251264 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14474: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587530240 2009-12-20,17:24:19: the beginning of the group is 474284032 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14475: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587563008 2009-12-20,17:24:19: the beginning of the group is 474316800 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14476: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587595776 2009-12-20,17:24:19: the beginning of the group is 474349568 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14477: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587628544 2009-12-20,17:24:19: the beginning of the group is 474382336 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14478: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587661312 2009-12-20,17:24:19: the beginning of the group is 474415104 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14479: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587694080 2009-12-20,17:24:19: the beginning of the group is 474447872 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162 2009-12-20,17:24:19: Reconfigure bitmap not at an expected location for group 14480: 2009-12-20,17:24:19: the gdr shows the bitmap at alloc block 587726848 2009-12-20,17:24:19: the beginning of the group is 474480640 2009-12-20,17:24:19: and the overhead for backups is 513 2009-12-20,17:24:19: LBA is 1162

I guess I’m in a holding pattern until they get back to me.

Well hopefully they get back to you soon

And its good to know the emergency firmware is still available!

So I was rereading your post and I wanted to clarify some things.

You upgraded to firmware 1.3.5 first, I’m assuming saturday night?

Then the next morning you were deleting data and started to receive the error messages then the issues with the drobo started when you rebooted?

So the firmware update was successful the night before and the issues started the next day?

The first post shows some errors deleting data at 11:59 on 12/20 (Sunday morning), so yes, that would probably have been after the firmware was upgraded (only decoding the log would have the relevant details there).

The firmware was ugpraded, but the Drobo was not rebooted at that point. I did a full fsck of the underlying filesystem on the Drobo on the 18th (Friday) which took about 18 hours to complete.

Once that was done, I rsync’d some data to the array from my remote server, and verified it (md5sum).

At this point, my Drobo had a complete copy of valid, verified data on a clean, checked filesystem.

I upgraded the firmware, at the suggestion of bhiga here in the forums. I did not immediately reboot. Once the server itself had aquesced, I did the following:

  1. shut down the Drobo
  2. halted the server
  3. unplugged power and data to the Drobo
  4. brought the server back up
  5. powered up the Drobo
  6. plugged in the data cable to the Drobo

And that’s where it stayed for a day or two more, with the 7 blue LEDs lit, not going past that.[hr]

The filesystem on the Drobo was checked before the upgrade, and before the final data was copied to it and data deleted from it.

I upgraded the firmware, brought the Drobo down and back up, and the underlying filesystem was trashed, and would not let me repair it after bringing the Drobo down to try to unmount it and check the fs.

Whatever trashed the filesystem, or rendered it unreadable, it happened after a clean filesystem check (18-hour operation) and after I upgraded from 1.3.1 to 1.3.5.

Up until the firmware upgrade, those drives were clean, intact, checked and not failing or causing problems in any way.