[Gluster-users] Is rebalance completely broken on 3.5.3 ?
Olav Peeters
opeeters at gmail.com
Fri Mar 20 17:56:19 UTC 2015
Hi Alessandro,
what you describe here reminds me of this issue:
http://www.spinics.net/lists/gluster-users/msg20144.html
And now that you mention it, the mess on our cluster could indeed have
been triggered by an aborted rebalance.
This is a very important clue, since apparently developers were never
able to reproduce the issue in the lab. I also tried to reproduce the
issue on a test cluster, but never succeeded.
The example you describe below seems to me relatively easy to fix. A
rebalance fix-layout would eventually get rid of the sticky bit files
(---------T) on your brick 5 and 6 and you could manually remove the
files created on 10/03 as long as you also remove the corresponding link
file in the .glusterfs dir on that brick.
I whole heartedly agree with you that this needs urgent attention of
developers before they start working on new features. A mess like this
in a distributed file system makes the file system unusable for
production. This should never happen, never! And if it does a rebalance
should be able to detect and fix it... fast and efficiently. I also
agree that the status of a rebalance should be more telling, giving a
clear idea how long it would still take to complete. On large clusters a
rebalance often takes ages and makes the entire cluster extremely
vulnerable. (another scary operation is a remove-brick operation, but
this is another story)
What I did in our case, maybe this could help you too as a quick fix for
the most critical directories, is to rsync to a different storage (via a
mount point). rsync only copies one file of duplicated files and you
could separately copy a good version (in the case below e.g.: -rw-r--r--
2 seviri users 68 May 26 2014
/data/glusterfs/home/brick1/seviri/.forward) of the problem files. But
probably, as soon as you remove the files created on 10/03 (incl. the
gluster link file in .glusterfs), the listing via your NFS mount will be
restored. Try this out with a couple of files you have back-upped to be
sure.
Hope this helps!
Cheers,
Olav
On 20/03/15 12:22, Alessandro Ipe wrote:
>
> Hi,
>
> After lauching a "rebalance" on an idle gluster system one week ago,
> its status told me it has scanned
>
> more than 23 millions files on each of my 6 bricks. However, without
> knowing at least the total files to
>
> be scanned, this status is USELESS from an end-user perspective,
> because it does not allow you to
>
> know WHEN the rebalance could eventually complete (one day, one week,
> one year or never). From
>
> my point of view, the total files per bricks could be obtained and
> maintained when activating quota,
>
> since the whole filesystem has to be crawled...
>
> After one week being offline and still no clue when the rebalance
> would complete, I decided to stop it...
>
> Enormous mistake... It seems that rebalance cannot manage to not screw
> some files. Example, on
>
> the only client mounting the gluster system, "ls -la /home/seviri" returns
>
> ls: cannot access /home/seviri/.forward: Stale NFS file handle
>
> ls: cannot access /home/seviri/.forward: Stale NFS file handle
>
> -????????? ? ? ? ? ? .forward
>
> -????????? ? ? ? ? ? .forward
>
> while this file could perfectly be accessed before (being rebalanced)
> and has not been modifed for at
>
> least 3 years.
>
> Getting the extended attributes on the various bricks 3, 4, 5, 6 (3-4
> replicate, 5-6 replicate)
>
> Brick 3:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 May 26 2014
> /data/glusterfs/home/brick1/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 Mar 10 10:22
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick1/seviri/.forward
>
> trusted.afr.home-client-8=0x000000000000000000000000
>
> trusted.afr.home-client-9=0x000000000000000000000000
>
> trusted.gfid=0xc1d268beb17443a39d914de917de123a
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.afr.home-client-10=0x000000000000000000000000
>
> trusted.afr.home-client-11=0x000000000000000000000000
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
>
> trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
>
> Brick 4:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 May 26 2014
> /data/glusterfs/home/brick1/seviri/.forward
>
> -rw-r--r-- 2 seviri users 68 Mar 10 10:22
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick1/seviri/.forward
>
> trusted.afr.home-client-8=0x000000000000000000000000
>
> trusted.afr.home-client-9=0x000000000000000000000000
>
> trusted.gfid=0xc1d268beb17443a39d914de917de123a
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.afr.home-client-10=0x000000000000000000000000
>
> trusted.afr.home-client-11=0x000000000000000000000000
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
>
> trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
>
> Brick 5:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> ---------T 2 root root 0 Mar 18 08:19
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
>
> Brick 6:
>
> ls -l /data/glusterfs/home/brick?/seviri/.forward
>
> ---------T 2 root root 0 Mar 18 08:19
> /data/glusterfs/home/brick2/seviri/.forward
>
> getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
>
> # file: data/glusterfs/home/brick2/seviri/.forward
>
> trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
>
> trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
>
> Looking at the results from bricks 3 & 4 shows something weird. The
> file exists on 2 sub-bricks
>
> storage directories, while it should only be found once on each brick
> server. Or is the issue lying in the
>
> results of bricks 5 & 6 ? How can I fix this, please ? By the way, the
> split-brain tutorial only covers
>
> BASIC split-brain conditions and not complex (real life) cases like
> this one. It would definitely benefit if
>
> enriched by this one.
>
> More generally, I think the concept of gluster is promising, but if
> basic commands (rebalance,
>
> absolutely needed after adding more storage) from its own cli allows
> to put the system into an
>
> unstable state, I am really starting to question its ability to be
> used in a production environment. And
>
> from an end-user perspective, I do not care about new features added,
> no matter how appealing they
>
> could be, if the basic ones are not almost totally reliable. Finally,
> testing gluster under high load on the
>
> brick servers (real world conditions) would certainly gives insight to
> the developpers on what it failing
>
> and what needs therefore to be fixed to mitigate this and improve
> gluster reliability.
>
> Forgive my harsh words/criticisms, but having to struggle with gluster
> issues for two weeks now is
>
> getting on my nerves since my colleagues can not use the data stored
> on it and I do not see any time
>
> from now when it will be back online.
>
> Regards,
>
> Alessandro.
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150320/eeee8605/attachment.html>
More information about the Gluster-users
mailing list