[Gluster-users] self-heal trouble after changing arbiter brick

Fri Feb 9 06:19:41 UTC 2018

On Fri, Feb 9, 2018 at 11:46 AM, Karthik Subrahmanya <ksubrahm at redhat.com>
wrote:

> Hey,
>
> Did the heal completed and you still have some entries pending heal?
> If yes then can you provide the following informations to debug the issue.
> 1. Which version of gluster you are running
> 2. Output of gluster volume heal <volname> info summary or gluster volume
> heal <volname> info
> 3. getfattr -d -e hex -m . <filepath-on-brick> output of any one of the
> file which is pending heal from all the bricks
>
> Regards,
> Karthik
>
> On Thu, Feb 8, 2018 at 12:48 PM, Seva Gluschenko <gvs at webkontrol.ru>
> wrote:
>
>> Hi folks,
>>
>> I'm troubled moving an arbiter brick to another server because of I/O
>> load issues. My setup is as follows:
>>
>> # gluster volume info
>>
>> Volume Name: myvol
>> Type: Distributed-Replicate
>> Volume ID: 43ba517a-ac09-461e-99da-a197759a7dc8
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 3 x (2 + 1) = 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: gv0:/data/glusterfs
>> Brick2: gv1:/data/glusterfs
>> Brick3: gv4:/data/gv01-arbiter (arbiter)
>> Brick4: gv2:/data/glusterfs
>> Brick5: gv3:/data/glusterfs
>> Brick6: gv1:/data/gv23-arbiter (arbiter)
>> Brick7: gv4:/data/glusterfs
>> Brick8: gv5:/data/glusterfs
>> Brick9: pluto:/var/gv45-arbiter (arbiter)
>> Options Reconfigured:
>> nfs.disable: on
>> transport.address-family: inet
>> storage.owner-gid: 1000
>> storage.owner-uid: 1000
>> cluster.self-heal-daemon: enable
>>
>> The gv23-arbiter is the brick that was recently moved from other server
>> (chronos) using the following command:
>>
>> # gluster volume replace-brick myvol chronos:/mnt/gv23-arbiter
>> gv1:/data/gv23-arbiter commit force
>> volume replace-brick: success: replace-brick commit force operation
>> successful
>>
>> It's not the first time I was moving an arbiter brick, and the heal-count
>> was zero for all the bricks before the change, so I didn't expect much
>> trouble then. What was probably wrong is that I then forced chronos out of
>> cluster with gluster peer detach command. All since that, over the course
>> of the last 3 days, I see this:
>>
>> # gluster volume heal myvol statistics heal-count
>> Gathering count of entries to be healed on volume myvol has been
>> successful
>>
>> Brick gv0:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv1:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv4:/data/gv01-arbiter
>> Number of entries: 0
>>
>> Brick gv2:/data/glusterfs
>> Number of entries: 64999
>>
>> Brick gv3:/data/glusterfs
>> Number of entries: 64999
>>
>> Brick gv1:/data/gv23-arbiter
>> Number of entries: 0
>>
>> Brick gv4:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv5:/data/glusterfs
>> Number of entries: 0
>>
>> Brick pluto:/var/gv45-arbiter
>> Number of entries: 0
>>
>> According to the /var/log/glusterfs/glustershd.log, the self-healing is
>> undergoing, so it might be worth just sit and wait, but I'm wondering why
>> this 64999 heal-count persists (a limitation on counter? In fact, gv2 and
>> gv3 bricks contain roughly 30 million files), and I feel bothered because
>> of the following output:
>>
>> # gluster volume heal myvol info heal-failed
>> Gathering list of heal failed entries on volume myvol has been
>> unsuccessful on bricks that are down. Please check if all brick processes
>> are running.
>>
>> I attached the chronos server back to the cluster, with no noticeable
>> effect. Any comments and suggestions would be much appreciated.
>>
>> --
>> Best Regards,
>>
>> Seva Gluschenko
>> CTO @ http://webkontrol.ru
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180209/b84717bf/attachment.html>