[Gluster-users] Failed rebalance - lost files, inaccessible files, permission issues

Sat Nov 9 09:39:11 UTC 2013

On 11/9/2013 1:47 AM, Anand Avati wrote:
> Thanks for the detailed info. I have not yet looked into your logs, but
> will do so soon. There have been patches on rebalance which do fix
> issues related to ownership. But I am not (yet) sure about bugs which
> caused data loss. One question I have is -
> 
> [2013-10-29 23:13:49.611069] I [dht-rebalance.c:647:dht___migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/KPA/__kpacontentminepix/docs/008/__058:
> attempting to move from mdfs-replicate-1 to mdfs-replicate-6
> [2013-10-29 23:13:49.611582] I [dht-rebalance.c:647:dht___migrate_file]
> 0-mdfs-dht: /REDACTED/mdfs/KPA/__kpacontentminepix/docs/008/__058:
> attempting to move from mdfs-replicate-1 to mdfs-replicate-6
> 
> Are these two lines from the same log file or separate log files? If
> they are from the same log, then it might be you
> need http://review.gluster.org/4300 (available in 3.4)

They are from the same log file - the one that I put on my dropbox
account and linked in the original message.  They are consecutive log
entries.

There are three visible problems caused by the failed rebalance, which
failed after moving 1.5TB of data.  One is a relatively small number of
lost files - 91 that I know about.  They are completely gone, can't find
them even on the bricks.  I even looked through the .gluster directory
on all the bricks for files with one link.  There weren't any.

The second problem is 32362 files that show up in the fuse mount with
---------T permissions, but have a read error when trying to access
them.  A few of those files have since become readable (as root) with no
changes on my part, but most of them are still unreadable.  I have
located all of those files via the bricks and saved a copy elsewhere so
I can replace the unreadable ones.

The third problem is over 800000 files with 000 permissions.

The only files I've really looked at in depth are the ones that were
mentioned in the rebalance log as failed to migrate.  There's far too
many files on the volume to do much else.  Spot-checking hasn't turned
up any other problems, though.

Here's a df output showing the bricks on the first server along with the
fuse-mounted volumes, followed by the gluster volume info:

http://fpaste.org/52886/13839892/

Let me know if there's any other info I can provide.

Thanks,
Shawn