[Gluster-users] Is rebalance completely broken on 3.5.3 ?

Thu Apr 2 16:41:58 UTC 2015

Hi Nithya,

Sorry that it took so long to respond...

1. Indeed, couple of weeks ago, I added 2 bricks (in replicate mode) with add-brick and since, I was never able to complete the required rebalance (however a rebalance fix-layout completed).
2. *home-rebalance.log*
[2015-03-13 21:32:58.066242] E [dht-rebalance.c:1328:gf_defrag_migrate_data] 0-home-dht: /seviri/.forward lookup failed
and the same "lookup failed" log for a lot of other files
[2015-03-13 21:32:58.245795] E [dht-linkfile.c:278:dht_linkfile_setattr_cbk] 0-home-dht: setattr of uid/gid on /seviri/.forward :<gfid:00000000-0000-0000-0000-000000000000> failed (Stale NFS file handle)
[2015-03-13 21:32:58.286201] E [dht-common.c:2465:dht_vgetxattr_cbk] 0-home-dht: Subvolume home-replicate-4 returned -1 (Stale NFS file handle)
[2015-03-13 21:32:58.286258] E [dht-rebalance.c:1336:gf_defrag_migrate_data] 0-home-dht: Failed to get node-uuid for /seviri/.forward
and after initiating a stop command on the cli
[2015-03-19 10:34:38.484381] E [dht-rebalance.c:1622:gf_defrag_fix_layout] 0-home-dht: Fix layout failed for /seviri/MSG/2007/MSG1_20070106/HRIT_200701060115
[2015-03-19 10:34:38.487426] E [dht-rebalance.c:1622:gf_defrag_fix_layout] 0-home-dht: Fix layout failed for /seviri/MSG/2007/MSG1_20070106
[2015-03-19 10:34:38.487943] E [dht-rebalance.c:1622:gf_defrag_fix_layout] 0-home-dht: Fix layout failed for /seviri/MSG/2007
[2015-03-19 10:34:38.488361] E [dht-rebalance.c:1622:gf_defrag_fix_layout] 0-home-dht: Fix layout failed for /seviri/MSG
[2015-03-19 10:34:38.488801] E [dht-rebalance.c:1622:gf_defrag_fix_layout] 0-home-dht: Fix layout failed for /seviri
3. We are exclusively accessing the servers through the native gluster fuse client, so no NFs mount.
4. The attributes of this specific file are given in my initial post at
 http://www.gluster.org/pipermail/gluster-users/2015-March/021175.html

Meanwhile, I launched a full heal and that specific file could be accessed normally, after a couple of days of healing... However, I now get in the client log (gluster.log) the following messages:
[2015-04-01 15:20:36.218425] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-7:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218555] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-5:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218630] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-2:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218770] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-4:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218840] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-6:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218915] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-9:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.218976] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-10:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.219230] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-1:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.220062] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-8:  metadata self heal  failed,   on /seviri
[2015-04-01 15:20:36.236306] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 2-home-replicate-11:  metadata self heal  failed,   on /seviri
and various other top level directories from / on the volume.
Is there a way to fix this ?

Regards,

A.

On Thursday 26 March 2015 11:36:19 Nithya Balachandran wrote:
> Hi Alessandro,
> 
> Thanks for the information. A few more questions:
> 
> 1. Did you do an add-brick or remove-brick before doing the rebalance? If
> yes, how many bricks did you add/remove? 2. Can you send us the rebalance,
> client(NFS log if you are using an NFS client only)  and brick logs? 3. It
> looks like you are using an NFS client. Can you please confirm? 4. Is
> /home/seviri/.forward the only file on which you are seeing the stale file
> handle errors? Can you please provide the following information for this
> file on all bricks - the xattrs for the parent directory (/home/seviri/) as
> well as the file on each brick - Brick1 to Brick24 - with details of which
> node it is on so we can get a clearer picture. - the ls -li output on the
> bricks for the file on each node.
> 
> 
> As far as I know, there have not been any major changes to rebalance between
> 3.5.3 and 3.6.3 but I will confirm.
> 
> Regards,
> Nithya
> 
> ----- Original Message -----
> From: "Alessandro Ipe" <Alessandro.Ipe at meteo.be>
> To: "Nithya Balachandran" <nbalacha at redhat.com>
> Cc: gluster-users at gluster.org
> Sent: Wednesday, 25 March, 2015 5:42:02 PM
> Subject: Re: [Gluster-users] Is rebalance completely broken on 3.5.3 ?
> 
> Hi Nithya,
> 
> 
> Thanks for your reply. I am glad that improving the rebalance status will be
> addressed in the (near) future. For my perspective, if the status is giving
> the total files to be scanned together with the files already scanned, it
> is sufficient information. Indeed, the user would see when it would
> complete (by doing several "gluster volume rebalance status" and computing
> differences according to elapsed time between them).
> 
> Please find below the answers to your questions:
> 1. Server and client are version 3.5.3
> 2. Indeed, I stopped the rebalance through the associated commdn from CLI,
> i.e. gluster <volume> rebalance stop
> 3. Very limited file operations were carried out through a single client
> mount (servers were almost idle)
> 4.gluster volume info :
> Volume Name: home
> Type: Distributed-Replicate
> Volume ID: 501741ed-4146-4022-af0b-41f5b1297766
> Status: Started
> Number of Bricks: 12 x 2 = 24
> Transport-type: tcp
> Bricks:
> Brick1: tsunami1:/data/glusterfs/home/brick1
> Brick2: tsunami2:/data/glusterfs/home/brick1
> Brick3: tsunami1:/data/glusterfs/home/brick2
> Brick4: tsunami2:/data/glusterfs/home/brick2
> Brick5: tsunami1:/data/glusterfs/home/brick3
> Brick6: tsunami2:/data/glusterfs/home/brick3
> Brick7: tsunami1:/data/glusterfs/home/brick4
> Brick8: tsunami2:/data/glusterfs/home/brick4
> Brick9: tsunami3:/data/glusterfs/home/brick1
> Brick10: tsunami4:/data/glusterfs/home/brick1
> Brick11: tsunami3:/data/glusterfs/home/brick2
> Brick12: tsunami4:/data/glusterfs/home/brick2
> Brick13: tsunami3:/data/glusterfs/home/brick3
> Brick14: tsunami4:/data/glusterfs/home/brick3
> Brick15: tsunami3:/data/glusterfs/home/brick4
> Brick16: tsunami4:/data/glusterfs/home/brick4
> Brick17: tsunami5:/data/glusterfs/home/brick1
> Brick18: tsunami6:/data/glusterfs/home/brick1
> Brick19: tsunami5:/data/glusterfs/home/brick2
> Brick20: tsunami6:/data/glusterfs/home/brick2
> Brick21: tsunami5:/data/glusterfs/home/brick3
> Brick22: tsunami6:/data/glusterfs/home/brick3
> Brick23: tsunami5:/data/glusterfs/home/brick4
> Brick24: tsunami6:/data/glusterfs/home/brick4
> Options Reconfigured:
> performance.cache-size: 512MB
> performance.io-thread-count: 64
> performance.flush-behind: off
> performance.write-behind-window-size: 4MB
> performance.write-behind: on
> nfs.disable: on
> features.quota: off
> cluster.read-hash-mode: 2
> diagnostics.brick-log-level: CRITICAL
> cluster.lookup-unhashed: on
> server.allow-insecure: on
> cluster.ensure-durability: on
> 
> For the logs, it will be more difficult because it happened several days
> ago, and they were rotated. But I can dig... By the way, do you need a
> specific logfile, because gluster produces a lot of them...
> 
> I read in some discussion on the gluster-users mailing list that rebalance
> on version 3.5.x could leave the system with errors when stopped (or even
> when ran up to its completion ?) and that rebalance had gone a complete
> rewrite in 3.6.x.  The issue is that I will put back online gluster next
> week, so my colleagues will definitively put it under high load and I was
> planning to run again the rebalance in the background. However, is it
> advisable ? Or should I wait after upgrading to 3.6.3 ?
> 
> I also noticed (currently undergoing a full heal on the volume) that
> accessing to some files on the client returned a "Transport endoint is not
> connected" the first time, but any new access was OK (probably due to
> self-healing). However, it is possible to setup a client or a volume
> parameter to just wait (and make the calling process wait) for the
> self-healing to complete and deliver the file the first time without
> issuing an error (extremely usefull in batch/operational processing) ?
> 
> 
> Regards,
> 
> 
> Alessandro.
> 
> On Wednesday 25 March 2015 05:09:38 Nithya Balachandran wrote:
> > Hi Alessandro,
> > 
> > 
> > I am sorry to hear that you are facing problems with rebalance.
> > 
> > Currently rebalance does not have the information as to how many files
> > exist on the volume and so cannot calculate/estimate the time it will
> > take to complete. Improving the rebalance status output to provide that
> > info is on our to-do list already and we will be working on that.
> > 
> > I have a few questions :
> > 
> > 1. Which version of Glusterfs are you using?
> > 2. How did you stop the rebalance ? I assume you ran "gluster <volume>
> > rebalance stop" but just wanted confirmation. 3. What file operations were
> > being performed during the rebalance? 4. Can you send the "gluster volume
> > info" output as well as the gluster log files?
> > 
> > Regards,
> > Nithya
> > 
> > ----- Original Message -----
> > From: "Alessandro Ipe" <Alessandro.Ipe at meteo.be>
> > To: gluster-users at gluster.org
> > Sent: Friday, March 20, 2015 4:52:35 PM
> > Subject: [Gluster-users] Is rebalance completely broken on 3.5.3 ?
> > 
> > 
> > 
> > Hi,
> > 
> > 
> > 
> > 
> > 
> > After lauching a "rebalance" on an idle gluster system one week ago, its
> > status told me it has scanned
> > 
> > more than 23 millions files on each of my 6 bricks. However, without
> > knowing at least the total files to
> > 
> > be scanned, this status is USELESS from an end-user perspective, because
> > it
> > does not allow you to
> > 
> > know WHEN the rebalance could eventually complete (one day, one week, one
> > year or never). From
> > 
> > my point of view, the total files per bricks could be obtained and
> > maintained when activating quota,
> > 
> > since the whole filesystem has to be crawled...
> > 
> > 
> > 
> > After one week being offline and still no clue when the rebalance would
> > complete, I decided to stop it...
> > 
> > Enormous mistake... It seems that rebalance cannot manage to not screw
> > some
> > files. Example, on
> > 
> > the only client mounting the gluster system, "ls -la /home/seviri" returns
> > 
> > ls: cannot access /home/seviri/.forward: Stale NFS file handle
> > 
> > ls: cannot access /home/seviri/.forward: Stale NFS file handle
> > 
> > -????????? ? ? ? ? ? .forward
> > 
> > -????????? ? ? ? ? ? .forward
> > 
> > while this file could perfectly be accessed before (being rebalanced) and
> > has not been modifed for at
> > 
> > least 3 years.
> > 
> > 
> > 
> > Getting the extended attributes on the various bricks 3, 4, 5, 6 (3-4
> > replicate, 5-6 replicate)
> > 
> > Brick 3:
> > 
> > ls -l /data/glusterfs/home/brick?/seviri/.forward
> > 
> > -rw-r--r-- 2 seviri users 68 May 26 2014
> > /data/glusterfs/home/brick1/seviri/.forward
> > 
> > -rw-r--r-- 2 seviri users 68 Mar 10 10:22
> > /data/glusterfs/home/brick2/seviri/.forward
> > 
> > 
> > 
> > getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
> > 
> > # file: data/glusterfs/home/brick1/seviri/.forward
> > 
> > trusted.afr.home-client-8=0x000000000000000000000000
> > 
> > trusted.afr.home-client-9=0x000000000000000000000000
> > 
> > trusted.gfid=0xc1d268beb17443a39d914de917de123a
> > 
> > 
> > 
> > # file: data/glusterfs/home/brick2/seviri/.forward
> > 
> > trusted.afr.home-client-10=0x000000000000000000000000
> > 
> > trusted.afr.home-client-11=0x000000000000000000000000
> > 
> > trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
> > 
> > trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000
> > 00 0000000200
> > 
> > trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
> > 
> > 
> > 
> > Brick 4:
> > 
> > ls -l /data/glusterfs/home/brick?/seviri/.forward
> > 
> > -rw-r--r-- 2 seviri users 68 May 26 2014
> > /data/glusterfs/home/brick1/seviri/.forward
> > 
> > -rw-r--r-- 2 seviri users 68 Mar 10 10:22
> > /data/glusterfs/home/brick2/seviri/.forward
> > 
> > 
> > 
> > getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
> > 
> > # file: data/glusterfs/home/brick1/seviri/.forward
> > 
> > trusted.afr.home-client-8=0x000000000000000000000000
> > 
> > trusted.afr.home-client-9=0x000000000000000000000000
> > 
> > trusted.gfid=0xc1d268beb17443a39d914de917de123a
> > 
> > 
> > 
> > # file: data/glusterfs/home/brick2/seviri/.forward
> > 
> > trusted.afr.home-client-10=0x000000000000000000000000
> > 
> > trusted.afr.home-client-11=0x000000000000000000000000
> > 
> > trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
> > 
> > trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000
> > 00 0000000200
> > 
> > trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
> > 
> > 
> > 
> > Brick 5:
> > 
> > ls -l /data/glusterfs/home/brick?/seviri/.forward
> > 
> > ---------T 2 root root 0 Mar 18 08:19
> > /data/glusterfs/home/brick2/seviri/.forward
> > 
> > 
> > 
> > getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
> > 
> > # file: data/glusterfs/home/brick2/seviri/.forward
> > 
> > trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
> > 
> > trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
> > 
> > 
> > 
> > Brick 6:
> > 
> > ls -l /data/glusterfs/home/brick?/seviri/.forward
> > 
> > ---------T 2 root root 0 Mar 18 08:19
> > /data/glusterfs/home/brick2/seviri/.forward
> > 
> > 
> > 
> > getfattr -d -m . -e hex /data/glusterfs/home/brick?/seviri/.forward
> > 
> > # file: data/glusterfs/home/brick2/seviri/.forward
> > 
> > trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
> > 
> > trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
> > 
> > 
> > 
> > Looking at the results from bricks 3 & 4 shows something weird. The file
> > exists on 2 sub-bricks
> > 
> > storage directories, while it should only be found once on each brick
> > server. Or is the issue lying in the
> > 
> > results of bricks 5 & 6 ? How can I fix this, please ? By the way, the
> > split-brain tutorial only covers
> > 
> > BASIC split-brain conditions and not complex (real life) cases like this
> > one. It would definitely benefit if
> > 
> > enriched by this one.
> > 
> > 
> > 
> > More generally, I think the concept of gluster is promising, but if basic
> > commands (rebalance,
> > 
> > absolutely needed after adding more storage) from its own cli allows to
> > put
> > the system into an
> > 
> > unstable state, I am really starting to question its ability to be used in
> > a production environment. And
> > 
> > from an end-user perspective, I do not care about new features added, no
> > matter how appealing they
> > 
> > could be, if the basic ones are not almost totally reliable. Finally,
> > testing gluster under high load on the
> > 
> > brick servers (real world conditions) would certainly gives insight to the
> > developpers on what it failing
> > 
> > and what needs therefore to be fixed to mitigate this and improve gluster
> > reliability.
> > 
> > 
> > 
> > Forgive my harsh words/criticisms, but having to struggle with gluster
> > issues for two weeks now is
> > 
> > getting on my nerves since my colleagues can not use the data stored on it
> > and I do not see any time
> > 
> > from now when it will be back online.
> > 
> > 
> > 
> > 
> > 
> > Regards,
> > 
> > 
> > 
> > 
> > 
> > Alessandro.
> > 
> > 
> > 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150402/ec3950bf/attachment.html>