[Gluster-users] Self healing does not see files to heal

Wed Aug 17 10:38:26 UTC 2016

On 08/17/2016 03:48 PM, Дмитрий Глушенок wrote:
> Unfortunately not:
>
> Remount FS, then access test file from second client:
>
> [root at srv02 ~]# umount /mnt
> [root at srv02 ~]# mount -t glusterfs srv01:/test01 /mnt
> [root at srv02 ~]# ls -l /mnt/passwd
> -rw-r--r--. 1 root root 1505 авг 16 19:59 /mnt/passwd
> [root at srv02 ~]# ls -l /R1/test01/
> итого 4
> -rw-r--r--. 2 root root 1505 авг 16 19:59 passwd
> [root at srv02 ~]#
>
> Then remount FS and check if accessing the file from second node 
> triggered self-heal on first node:
>
> [root at srv01 ~]# umount /mnt
> [root at srv01 ~]# mount -t glusterfs srv01:/test01 /mnt
> [root at srv01 ~]# ls -l /mnt

Can you try `stat /mnt/passwd` from this node after remounting? You need 
to explicitly lookup the file.  `ls -l /mnt`  is only triggering readdir 
on the parent directory.
If that doesn't work, is this mount connected to both clients? i.e. if 
you create a new file from here, is it getting replicated to both bricks?

-Ravi

> итого 0
> [root at srv01 ~]# ls -l /R1/test01/
> итого 0
> [root at srv01 ~]#
>
> Nothing appeared.
>
> [root at srv01 ~]# gluster volume info test01
> Volume Name: test01
> Type: Replicate
> Volume ID: 2c227085-0b06-4804-805c-ea9c1bb11d8b
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: srv01:/R1/test01
> Brick2: srv02:/R1/test01
> Options Reconfigured:
> features.scrub-freq: hourly
> features.scrub: Active
> features.bitrot: on
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
> [root at srv01 ~]#
>
> [root at srv01 ~]# gluster volume get test01 all | grep heal
> cluster.background-self-heal-count      8
> cluster.metadata-self-heal              on
> cluster.data-self-heal                  on
> cluster.entry-self-heal                 on
> cluster.self-heal-daemon                on
> cluster.heal-timeout                    600
> cluster.self-heal-window-size           1
> cluster.data-self-heal-algorithm        (null)
> cluster.self-heal-readdir-size          1KB
> cluster.heal-wait-queue-length          128
> features.lock-heal                      off
> features.lock-heal                      off
> storage.health-check-interval           30
> features.ctr_lookupheal_link_timeout    300
> features.ctr_lookupheal_inode_timeout   300
> cluster.disperse-self-heal-daemon       enable
> disperse.background-heals               8
> disperse.heal-wait-qlength              128
> cluster.heal-timeout                    600
> cluster.granular-entry-heal             no
> [root at srv01 ~]#
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>> 17 авг. 2016 г., в 11:30, Ravishankar N <ravishankar at redhat.com 
>> <mailto:ravishankar at redhat.com>> написал(а):
>>
>> On 08/17/2016 01:48 PM, Дмитрий Глушенок wrote:
>>> Hello Ravi,
>>>
>>> Thank you for reply. Found bug number (for those who will google the 
>>> email) https://bugzilla.redhat.com/show_bug.cgi?id=1112158
>>>
>>> Accessing the removed file from mount-point is not always working 
>>> because we have to find a special client which DHT will point to the 
>>> brick with removed file. Otherwise the file will be accessed from 
>>> good brick and self-healing will not happen (just verified). Or by 
>>> accessing you meant something like touch?
>>
>> Sorry should have been more explicit. I meant triggering a lookup on 
>> that file with `stat filename`. I don't think you need a special 
>> client. DHT sends the lookup to AFR which in turn sends to all its 
>> children. When one of them returns ENOENT (because you removed it 
>> from the brick), AFR will automatically trigger heal. I'm guessing it 
>> is not always working in your case due to caching at various levels 
>> and the lookup not coming till AFR. If you do it from a fresh mount 
>> ,it should always work.
>> -Ravi
>>
>>> Dmitry Glushenok
>>> Jet Infosystems
>>>
>>>> 17 авг. 2016 г., в 4:24, Ravishankar N <ravishankar at redhat.com 
>>>> <mailto:ravishankar at redhat.com>> написал(а):
>>>>
>>>> On 08/16/2016 10:44 PM, Дмитрий Глушенок wrote:
>>>>> Hello,
>>>>>
>>>>> While testing healing after bitrot error it was found that self 
>>>>> healing cannot heal files which were manually deleted from brick. 
>>>>> Gluster 3.8.1:
>>>>>
>>>>> - Create volume, mount it locally and copy test file to it
>>>>> [root at srv01 ~]# gluster volume create test01 replica 2 
>>>>>  srv01:/R1/test01 srv02:/R1/test01
>>>>> volume create: test01: success: please start the volume to access data
>>>>> [root at srv01 ~]# gluster volume start test01
>>>>> volume start: test01: success
>>>>> [root at srv01 ~]# mount -t glusterfs srv01:/test01 /mnt
>>>>> [root at srv01 ~]# cp /etc/passwd /mnt
>>>>> [root at srv01 ~]# ls -l /mnt
>>>>> итого 2
>>>>> -rw-r--r--. 1 root root 1505 авг 16 19:59 passwd
>>>>>
>>>>> - Then remove test file from first brick like we have to do in 
>>>>> case of bitrot error in the file
>>>>
>>>> You also need to remove all hard-links to the corrupted file from 
>>>> the brick, including the one in the .glusterfs folder.
>>>> There is a bug in heal-full that prevents it from crawling all 
>>>> bricks of the replica. The right way to heal the corrupted files as 
>>>> of now is to access them from the mount-point like you did after 
>>>> removing the hard-links. The list of files that are corrupted can 
>>>> be obtained with the scrub status command.
>>>>
>>>> Hope this helps,
>>>> Ravi
>>>>
>>>>> [root at srv01 ~]# rm /R1/test01/passwd
>>>>> [root at srv01 ~]# ls -l /mnt
>>>>> итого 0
>>>>> [root at srv01 ~]#
>>>>>
>>>>> - Issue full self heal
>>>>> [root at srv01 ~]# gluster volume heal test01 full
>>>>> Launching heal operation to perform full self heal on volume 
>>>>> test01 has been successful
>>>>> Use heal info commands to check status
>>>>> [root at srv01 ~]# tail -2 /var/log/glusterfs/glustershd.log
>>>>> [2016-08-16 16:59:56.483767] I [MSGID: 108026] 
>>>>> [afr-self-heald.c:611:afr_shd_full_healer] 0-test01-replicate-0: 
>>>>> starting full sweep on subvol test01-client-0
>>>>> [2016-08-16 16:59:56.486560] I [MSGID: 108026] 
>>>>> [afr-self-heald.c:621:afr_shd_full_healer] 0-test01-replicate-0: 
>>>>> finished full sweep on subvol test01-client-0
>>>>>
>>>>> - Now we still see no files in mount point (it becomes empty right 
>>>>> after removing file from the brick)
>>>>> [root at srv01 ~]# ls -l /mnt
>>>>> итого 0
>>>>> [root at srv01 ~]#
>>>>>
>>>>> - Then try to access file by using full name (lookup-optimize and 
>>>>> readdir-optimize are turned off by default). Now glusterfs shows 
>>>>> the file!
>>>>> [root at srv01 ~]# ls -l /mnt/passwd
>>>>> -rw-r--r--. 1 root root 1505 авг 16 19:59 /mnt/passwd
>>>>>
>>>>> - And it reappeared in the brick
>>>>> [root at srv01 ~]# ls -l /R1/test01/
>>>>> итого 4
>>>>> -rw-r--r--. 2 root root 1505 авг 16 19:59 passwd
>>>>> [root at srv01 ~]#
>>>>>
>>>>> Is it a bug or we can tell self heal to scan all files on all 
>>>>> bricks in the volume?
>>>>>
>>>>> --
>>>>> Dmitry Glushenok
>>>>> Jet Infosystems
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160817/a68df3d9/attachment.html>