[Gluster-users] Is rebalance completely broken on 3.5.3 ?

Fri Mar 20 17:56:19 UTC 2015

Hi Alessandro,
what you describe here reminds me of this issue:
http://www.spinics.net/lists/gluster-users/msg20144.html

And now that you mention it, the mess on our cluster could indeed have 
been triggered by an aborted rebalance.
This is a very important clue, since apparently developers were never 
able to reproduce the issue in the lab. I also tried to reproduce the 
issue on a test cluster, but never succeeded.

The example you describe below seems to me relatively easy to fix. A 
rebalance fix-layout would eventually get rid of the sticky bit files 
(---------T) on your brick 5 and 6 and you could manually remove the 
files created on 10/03 as long as you also remove the corresponding link 
file in the .glusterfs dir on that brick.

I whole heartedly agree with you that this needs urgent attention of 
developers before they start working on new features. A mess like this 
in a distributed file system makes the file system unusable for 
production. This should never happen, never! And if it does a rebalance 
should be able to detect and fix it... fast and efficiently. I also 
agree that the status of a rebalance should be more telling, giving a 
clear idea how long it would still take to complete. On large clusters a 
rebalance often takes ages and makes the entire cluster extremely 
vulnerable. (another scary operation is a remove-brick operation, but 
this is another story)

What I did in our case, maybe this could help you too as a quick fix for 
the most critical directories, is to rsync to a different storage (via a 
mount point). rsync only copies one file of duplicated files and you 
could separately copy a good version (in the case below e.g.: -rw-r--r-- 
2 seviri users 68 May 26 2014 
/data/glusterfs/home/brick1/seviri/.forward) of the problem files. But 
probably, as soon as you remove the files created on 10/03 (incl. the 
gluster link file in .glusterfs), the listing via your NFS mount will be 
restored. Try this out with a couple of files you have back-upped to be 
sure.

Hope this helps!

Cheers,
Olav

On 20/03/15 12:22, Alessandro Ipe wrote:
>
> Hi,
>
> After lauching a "rebalance" on an idle gluster system one week ago, 
> its status told me it has scanned
>
> more than 23 millions files on each of my 6 bricks. However, without 
> knowing at least the total files to
>
> be scanned, this status is USELESS from an end-user perspective, 
> because it does not allow you to
>
> know WHEN the rebalance could eventually complete (one day, one week, 
> one year or never). From
>
> my point of view, the total files per bricks could be obtained and 
> maintained when activating quota,
>
> since the whole filesystem has to be crawled...
>
> After one week being offline and still no clue when the rebalance 
> would complete, I decided to stop it...
>
> Enormous mistake... It seems that rebalance cannot manage to not screw 
> some files. Example, on
>
> the only client mounting the gluster system, "ls -la /home/seviri" returns
>
> ls: cannot access /home/seviri/.forward: Stale NFS file handle
>
> ls: cannot access /home/seviri/.forward: Stale NFS file handle
>
> -????????? ? ? ? ? ? .forward
>
> -????????? ? ? ? ? ? .forward
>
> while this file could perfectly be accessed before (being rebalanced) 
> and has not been modifed for at
>
> least 3 years.
>
> Getting the extended attributes on the various bricks 3, 4, 5, 6 (3-4 
> replicate, 5-6 replicate)
>
> Brick 3:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 May 26 2014 
> /data/glusterfs/home/brick1/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 Mar 10 10:22 
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick1/seviri/.forward
>
> trusted.afr.home-client-8=0x000000000000000000000000
>
> trusted.afr.home-client-9=0x000000000000000000000000
>
> trusted.gfid=0xc1d268beb17443a39d914de917de123a
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.afr.home-client-10=0x000000000000000000000000
>
> trusted.afr.home-client-11=0x000000000000000000000000
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
>
> trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
>
> Brick 4:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 May 26 2014 
> /data/glusterfs/home/brick1/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 Mar 10 10:22 
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick1/seviri/.forward
>
> trusted.afr.home-client-8=0x000000000000000000000000
>
> trusted.afr.home-client-9=0x000000000000000000000000
>
> trusted.gfid=0xc1d268beb17443a39d914de917de123a
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.afr.home-client-10=0x000000000000000000000000
>
> trusted.afr.home-client-11=0x000000000000000000000000
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
>
> trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
>
> Brick 5:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> ---------T 2 root root 0 Mar 18 08:19 
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
>
> Brick 6:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> ---------T 2 root root 0 Mar 18 08:19 
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
>
> Looking at the results from bricks 3 & 4 shows something weird. The 
> file exists on 2 sub-bricks
>
> storage directories, while it should only be found once on each brick 
> server. Or is the issue lying in the
>
> results of bricks 5 & 6 ? How can I fix this, please ? By the way, the 
> split-brain tutorial only covers
>
> BASIC split-brain conditions and not complex (real life) cases like 
> this one. It would definitely benefit if
>
> enriched by this one.
>
> More generally, I think the concept of gluster is promising, but if 
> basic commands (rebalance,
>
> absolutely needed after adding more storage) from its own cli allows 
> to put the system into an
>
> unstable state, I am really starting to question its ability to be 
> used in a production environment. And
>
> from an end-user perspective, I do not care about new features added, 
> no matter how appealing they
>
> could be, if the basic ones are not almost totally reliable. Finally, 
> testing gluster under high load on the
>
> brick servers (real world conditions) would certainly gives insight to 
> the developpers on what it failing
>
> and what needs therefore to be fixed to mitigate this and improve 
> gluster reliability.
>
> Forgive my harsh words/criticisms, but having to struggle with gluster 
> issues for two weeks now is
>
> getting on my nerves since my colleagues can not use the data stored 
> on it and I do not see any time
>
> from now when it will be back online.
>
> Regards,
>
> Alessandro.
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150320/eeee8605/attachment.html>