Drobo

Large copy literally disabling my shares

I’m trying to use a program that automatically renames some of my large media files and then copies them to the Drobo FS. The files originate on my Win 7 x64 machine and are copied to the Drobo FS by a Win32 program called Couch Potato. From the logs and the results I see the program successfully renames/copies the first file in my queue, but then when it comes to the second copy it reports an I/O error. At this point all the drives drop offline in every computer connected to the Drobo FS and the Dashboard reports no Drobo connected. The lights on the FS look normal with all greens for the drives and lower-left power indicator.

I can ssh into the Drobo, but when I do the /mnt directory is completely empty. No Shares, nothing. I have to execute /sbin/reboot or toggle the power button to make the FS visible to the Dashboard and my network again.

I have already contacted support for the Couch Potato program, but it seems odd to me that a simple copy operation is able to completely disable the Drobo FS. Is there any way to prevent 3rd party programs on my network from disabling the shares like that? Is there a particular log that might help troubleshoot this issue?

Finally, I hope to avoid getting sidetracked with a discussion of what files I’m copying and why. For the sake of clarity I’ll say they are all legally-obtained media files that are not properly named or embedded with metadata. They generally range from 4-11 GB in size.

hi domedtx, is it always the same 2nd file that has i/o problems?
and can you copy and paste / move that file through normal windows?

ive never used couch potato (have probably been one from time to time though :slight_smile: but if that file is problematic then maybe theres another kind of issue with it or its being locked ?

(you could try proces explorer from sysinternals, and try a dll search and handle search for that filename to see if anythings locking it)

So far it has occurred with several different files. They copy fine with Explorer and WinSCP. The whole point was to automate the process so they renamed and moved automatically, but I’ll have to go back to manual for now. Clearly the problem is with Couch Potato, but I’m curious what about it would be able to make the shares completely vanish. I guess I’ll have to wait for the next release of CP to see if they can fix it.

hmmm, maybe its the “renaming” part which is causing problems?

(eg ive often clicked on a large file, say 300mb, and when i rightclick my other tools, antivirus etc kick in and cause a bit of delay on the file, and that stops me from cutting and pasting it somewhere, unless i close the explorer window first, (or wait or change folder views before pasting)

idea :slight_smile: can you split your jobs into 2 parts?
job 1 = renaming stuff
job 2 = moving the renamed stuff

(or something similar, maybe with ample time between the jobs)?

Not that I am asking to keep doing this to your Drobo, but next time this happens, could you please ssh into your Drobo and paste here the output from these commands:
[list]
[]dmesg | tail -n 20
[
]tail -n 20 /var/log/messages
[*]tail -n 20 /var/log/nasd.log
[/list]
Maybe there is something in there that can help us figure out why the mount is dropped on the FS.

I’m resurrecting my old thread because the problem, which I initially thought was restricted to CouchPotato, keeps coming back.

Over the summer I had some time on my hands while deployed, so I ripped hundreds of my DVDs to mkv files. I just hooked up the external drive to my computer and tried to copy a large amount (~200GB) to the Drobo. After about 20 minutes the copy process failed because the target directories couldn’t be created. Sure enough the Drobo was gone from my Windows, Mac and Linux computers.

I tried ssh into the drive where the /mnt directory was completely empty. I issued a reboot command, but after rebooting the device didn’t show up in Windows and was no longer accessible by ssh. I’ve since cold-started it twice and rebooted the Windows machine, but I still don’t see the device. All the lights are lit fine and the switch shows me it is connected to the network.

This is what will make me get rid of the Drobo. I can’t deal with it mysteriously going offline and then being so much trouble to get back.

I sent a summary of the problem to support before my year ran out, and their response was they don’t support ssh. No information about why the drive keeps dropping offline, just the mention of ssh triggered the “we don’t support ssh” canned reply.

I’ve restarted the device twice since starting this post, once completely unplugging and re-plugging it, but it still doesn’t show up even though the front lights look normal. I’m hopeful I’ll get it back soon and be able to post the requested logs.[hr]
The device came up right after I hit send. Here are the logs requested, although they seem to be full of startup info and nothing from before I restarted (especially since I re-started 5 times). Next time I will try to catch it while the drives are disconnected and ssh still works.

# dmesg | tail -n 20 sd 0:0:0:0: [sda] 34359738368 512-byte hardware sectors (17592186 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 03 00 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and FUA sda:<6>about to lock the queue about to look for elements about to unlock the queue about to wait for completion sda1 sda2 sd 0:0:0:0: [sda] Attached SCSI disk process_get_msg:NR trying to free elt: 93013488, cache: 93cd06e0 sd 0:0:0:0: Attached scsi generic sg0 type 0 handle_intercore_msg: Got Time Valid message process_get_msg:NR trying to free elt: 93013448, cache: 93cd06e0 kjournald starting. Commit interval 5 seconds EXT3-fs warning: maximal mount count reached, running e2fsck is recommended EXT3 FS on sda1, internal journal EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Adding 262136k swap on /mnt/DroboFS/swapfile. Priority:-1 extents:67 across:266792k

tail -n 20 /var/log/messages tail: can't open '/var/log/messages': No such file or directory tail: no files

[code]# tail -n 20 /var/log/nasd.log
Starting droboadmin.
Starting dependancy: apache
Starting apache.

Tue Oct 18 17:22:35 2011: ==========================================================
Tue Oct 18 17:22:35 2011: DNASConfigInitThread: LoadConfigAndStart Finished correctly
Tue Oct 18 17:22:35 2011: Trying command: /sbin/after_fsck /mnt/DroboFS/
ls: /mnt/DroboFS//lost+found/: No such file or directory
mv: cannot rename ‘/mnt/DroboFS//lost+found’: No such file or directory
forkProcessAndWaitForResult: /sbin/after_fsck exited with a 256 value
Tue Oct 18 17:22:35 2011: LinuxSupportExecuteCommand: /sbin/after_fsck command failed with return value 256
Tue Oct 18 17:22:35 2011: DNASConfigInitThread: Sharing data in lost+found for /mnt/DroboFS failed with error 256
Tue Oct 18 17:22:54 2011: SledCommandThread::~SledCommandThread: 0x1a68b8
Tue Oct 18 17:22:54 2011: Done with SledCommandThread::~SledCommandThread: 0x1a6 8b8
Tue Oct 18 17:23:05 2011: SledCommandThread::HandleTimeoutEvent: GetNASSharesConfig: Received command
Tue Oct 18 17:23:05 2011: SledCommandThread::HandleTimeoutEvent: GetNASAdminConfig: Received command
Tue Oct 18 17:23:06 2011: SledCommandThread::HandleTimeoutEvent: GetNASAdminConfig: Received command
Tue Oct 18 17:23:39 2011: SledCommandThread::HandleTimeoutEvent: GetNASSharesConfig: Received command
Tue Oct 18 17:23:39 2011: SledCommandThread::HandleTimeoutEvent: GetNASAdminConfig: Received command
Tue Oct 18 17:23:40 2011: SledCommandThread::HandleTimeoutEvent: GetNASAdminConfig: Received command
[/code][hr]
Not trying to flood the forum, but now as I’m trying to copy 1 large file the copy process keeps completely freezing then restarting. I’m getting an average copy speed of 4.6 MB/s where I usually get about 30. These two lines showed up in the dmesg log:

dri_dnas_abort: called to abort cmnd: 8d70c3c0, 8d0758c0 dri_dnas_abort: called to abort cmnd: 8d70cde0, 8d075560

I don’t want to scare you, but restarting the Drobo over and over might have made things worse.

A bit of background first: The FS uses EXT3 as the filesystem for the protected storage area. The thing about EXT3 is that every once in a while the filesystem driver will schedule a filesystem check “just to be sure”.

In the normal case, this shouldn’t take too long since there will be no problems. In your case, however, there were problems in your filesystem and the reason why your FS would not finish booting was that the filesystem check may take quite a while to finish on very large datasets.

The reason why you did not get any SSH is that DroboApps are started only after the protected storage area is mounted. Until the filesystem checking is completed, the DroboApps will not start, although the underlying IP stack might already have been started.

It is weird that you could SSH in while /mnt was empty, but I guess it is possible that the OS unmounted the protected storage because some major problem was found, without killing the SSH process.

In other words, I can only say that the next time this happens, just let it run and don’t try to stop it.

Finally, don’t blame the FS for the filesystem inconsistencies. This whole “just-in-case” checking of the filesystem is something that is not specific to the FS, and happens to all Linux-based devices. A quick googling will show you how often filesystem corruption happens.

[quote=“ricardo, post:7, topic:2530”]
I don’t want to scare you, but restarting the Drobo over and over might have made things worse…In other words, I can only say that the next time this happens, just let it run and don’t try to stop it.[/quote]

So you’re saying it’s normal for the Drobo to completely drop offline every few days during large file copy operations? I’ve tried letting it wait all night without restarting, and it never came back. However, after a restart, it generally comes back right away.

I’m skeptical there are so many errors (in the system that reports all drives are healthy) that it has to drop offline every few days. I’ve been running Linux with ext3 drives for years and have never had a drive just disappear while I was using it, not once, let alone 3 times a week.

I don’t feel like I’m blaming the FS for anything except for not being available for extended periods, thereby making it useless to me. There’s got to be something going on if I can’t get 5 brand new drives to stay online while I copy data to them.

For curiosity’s sake, what part of my logs tell you I had problems with my filesystem? I see a few errors but nothing that tells me I have major filesystem problems that would cause the device to go offline. I’d appreciate help figuring out how to troubleshoot this further.

Thanks for the reply.

Nope. What I’m saying is that if you run into a problem during a large file copy, then let the next restart take as long as it has to. Don’t interrupt the restart by restarting again.

Yeah, the disappearing is a bit puzzling, but could be explained by the filesystem driver being dumped from memory due to low RAM, for instance. Don’t let the dashboard fool you, though. Healthy drives just means that the drives are healthy, not that the filesystem on top of them is ok.

I agree. My point is that you could probably google for similar issues on Linux systems. Whatever was used to fix those would most likely be applicable to the FS.

These messages are particularly troubling, if your Drobo is back to “normal”:

ls: /mnt/DroboFS//lost+found/: No such file or directory
mv: cannot rename '/mnt/DroboFS//lost+found': No such file or directory
forkProcessAndWaitForResult: /sbin/after_fsck exited with a 256 value
Tue Oct 18 17:22:35 2011: LinuxSupportExecuteCommand: /sbin/after_fsck command failed with return value 256
Tue Oct 18 17:22:35 2011: DNASConfigInitThread: Sharing data in lost+found for /mnt/DroboFS failed with error 256

My suggestion to you is to keep a few SSH windows open on these files and start your copy again. If the problem happens again, then you should have a better view of what happened when the array went down.

One last idea: are you sure the disk pack is not overheating?

I appreciate the detailed responses. Obviously I’m a little stressed about the issue because the Drobo FS is our media server and the “WAF” (Wife Approval Factor) goes to zero when she can’t watch her movies or TV. :slight_smile:

Your last comment caught my attention: how do I know if the disk pack is overheating? I’m not seeing any indication of a problem, but that’s most likely because I’m not looking in the right place.

I’m going to try to force an fsck tonight to see if it helps. I’ll also read through the forum about overheating to see if I can determine whether that is occurring.

Thank you again.

As far as I can tell there is no indication of temperature. You kinda have to improvise there. That being said, if your Drobo has good airflow overheating shouldn’t be a problem.

In case anyone’s watching, my DroboFS went down. Again. It also went down yesterday. Here’s what I see in the logs:

Tue Nov 22 17:56:29 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9

This is repeated over and over. Any ideas?

I am having this EXACT same problem… DEAR LORD I am very very annoyed… Support hasn’t been any help at all, just keeps saying that my logs seem fine and there’s no hardware issues.

It’ll be working fine until I’m doing a backup with Crashplan on Windows7 or torrenting to the Drobo or something… If a transfer takes awhile, it will disappear even though all the lights are green and everything seems fine. Then I have to restart my Drobo just to get the shares back.

Apparently I can SSH in still… and when I run the commands from earlier, I get the following…

dmesg | tail -n 20

sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device

tail -n 20 /var/log/messages

tail: can’t open ‘/var/log/messages’: No such file or directory
tail: no files

tail -n 20 /var/log/nasd.log

Mon Dec 12 06:50:05 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:05 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:06 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:06 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:07 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:07 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:08 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:08 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:09 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:09 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:10 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:10 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:11 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:11 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:12 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:12 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:13 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:13 2011: SledDiscoveryAgent::run: Failed to get localhost address
Mon Dec 12 06:50:14 2011: SledDiscoveryAgent::run: failed to get name for interface 1, error = 9
Mon Dec 12 06:50:14 2011: SledDiscoveryAgent::run: Failed to get localhost address

I just want to bring this thread back to life… if anyone has any insight on this… Drobo support can’t figure it out & they’ve even given me a new Drobo and it’s better now… I can last a good 2-3 days before it just disappears on me again… And again, I can SSH in just fine but when I go into /mnt/DroboFS/ it’s an empty directory… I have to restart my FS every couple days to prevent this from happening, this is absolute non-sense.

I wound up putting my drives in an old computer in an attempt to use FreeNAS. FreeNAS immediately reported 2 of the 5 drives had serious errors and needed to be replaced. While waiting for the warranty replacement drives I decided to purchase a Thecus 5-bay NAS.

I think it’s likely the bad drives were causing me problems with the Drobo, although I can’t say for sure they are what caused the Drobo to go offline. I can say for sure that I expected the Drobo FS to warn me of impending hard drive failure, and I’m disappointed that it didn’t warn me of two failed drives.

Gotcha. Yeah the tech support person notified me of one drive that was going bad so I replaced it… and perhaps that is when everything got better… but then they said one of my other drives are running slower and I didn’t even want to believe it so I haven’t tried taking that one out. You would think that the Drobo would… tell you if a drive was going bad.