[Gluster-users] Question about "Possibly undergoing heal" on a file being reported.

Fri May 6 18:50:45 UTC 2016

I have performed the state dump as you requested on both replica pair hosts.  They hosts are called n1c1cl1 and n1c2cl1.  The file is on /data/brick0/gv0cl1 on both hosts.  The attached zip file contains the state dump results of brick0 for both hosts.  I identified the process ID by finding the glusterfsd process that had /data/brick0/gv0cl1 in the command line.

Let me know if you need further information.

P.S. The "Possibly undergoing heal" is still showing and the "trusted.afd.dirty" is still changing even after the VM has been off since yesterday. 

Richard Klein
RSI

> -----Original Message-----
> From: Ravishankar N [mailto:ravishankar at redhat.com]
> Sent: Friday, May 06, 2016 12:05 AM
> To: Richard Klein (RSI); gluster-users at gluster.org
> Subject: Re: [Gluster-users] Question about "Possibly undergoing heal" on a file
> being reported.
> 
> Thanks for the response. The healinfo outputs  'Possibly undergoing heal'  only
> when the selfheal daemon is performing heal and not when there is IO from
> the mount. Could you provide the state dump of the 2 bricks (and the mount too
> if you know from which mount this vm image is being accessed)?
> 
> The command is `kill -USR1 <pid>` where pid is the process id of the brick or
> fuse mount. The statedump will be saved in `gluster --print-statedumpdir`
> Wanted to check if there are any stale locks being held on the bricks.
> 
> Thanks,
> Ravi
> 
> On 05/06/2016 01:22 AM, Richard Klein (RSI) wrote:
> > I agree there is activity but it's very low I/O based, like updating log files.  It
> shouldn't be high enough IO to keep it permanently in the "Possibly undergoing
> healing" state for days.  But just to make sure, I powered off the VM and there
> is no activity now at all and the "trusted.afr.dirty" is still changing.  I will leave
> the VM in a powered off state until tomorrow.  I agree with you that is
> shouldn't but that is my dilemma.
> >
> > Thanks for the insight,
> >
> > Richard Klein
> > RSI
> >
> >> -----Original Message-----
> >> From: gluster-users-bounces at gluster.org [mailto:gluster-users-
> >> bounces at gluster.org] On Behalf Of Joe Julian
> >> Sent: Thursday, May 05, 2016 1:44 PM
> >> To: gluster-users at gluster.org
> >> Subject: Re: [Gluster-users] Question about "Possibly undergoing
> >> heal" on a file being reported.
> >>
> >> FYI, that's not "no activity". The file is clearly changing. The
> >> dirty state flipping back and forth between 1 and 0 is a byproduct of
> >> writes occurring. The clients set the flag, do the write, then clear the flag.
> >> My guess is that's why it's only "possibly" undergoing self-heal. The
> >> write may have still been pending at the moment of the check.
> >>
> >> On 05/05/2016 10:22 AM, Richard Klein (RSI) wrote:
> >>> There are 2 hosts involved and we have a replica value of 2.  The
> >>> hosts are
> >> called n1c1cl1 and n1c2cl1.  Below is the info you requested. The
> >> file name in gluster is "/97f52c71-80bd-4c2b-8e47-3c8c77712687".
> >>> -- From the n1c1cl1 brick --
> >>>
> >>> [root at n1c1cl1 ~]# ll -h
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>> -rwxr--r--. 2 root root 3.7G May  5 12:10
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>>
> >>> [root at n1c1cl1 ~]# getfattr -d -m . -e hex
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>> getfattr: Removing leading '/' from absolute path names # file:
> >>> data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>>
> >>
> security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c74
> >> 5
> >>> f743a733000
> >>> trusted.afr.dirty=0xe68000000000000000000000
> >>> trusted.bit-rot.version=0x020000000000000057196a8d000e1606
> >>> trusted.gfid=0xb1a49bd1ea01479f9a8277992461e85f
> >>>
> >>> -- From the n1c2cl1 brick --
> >>>
> >>> [root at n1c2cl1 ~]# ll -h
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>> -rwxr--r--. 2 root root 3.7G May  5 12:16
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>>
> >>> [root at n1c2cl1 ~]# getfattr -d -m . -e hex
> >>> /data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>> getfattr: Removing leading '/' from absolute path names # file:
> >>> data/brick0/gv0cl1/97f52c71-80bd-4c2b-8e47-3c8c77712687
> >>>
> >>
> security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c74
> >> 5
> >>> f743a733000
> >>> trusted.afr.dirty=0xd38000000000000000000000
> >>> trusted.bit-rot.version=0x020000000000000057196a8d000e20ae
> >>> trusted.gfid=0xb1a49bd1ea01479f9a8277992461e85f
> >>>
> >>> --
> >>>
> >>> The "trusted.afr.dirty" is changing about 2 or 3 times a minute on both
> files.
> >> Let me know if you need further info and thanks.
> >>> Richard Klein
> >>> RSI
> >>>
> >>>
> >>>
> >>> From: Ravishankar N [mailto:ravishankar at redhat.com]
> >>> Sent: Wednesday, May 04, 2016 8:52 PM
> >>> To: Richard Klein (RSI); gluster-users at gluster.org
> >>> Subject: Re: [Gluster-users] Question about "Possibly undergoing
> >>> heal" on a
> >> file being reported.
> >>>
> >>>> On 05/05/2016 01:50 AM, Richard Klein (RSI) wrote:
> >>>> First time e-mailer to the group, greetings all.  We are using
> >>>> Gluster 3.7.6 in
> >> Cloudstack on CentOS7 with KVM.  Gluster is our primary storage.  All
> >> is going well >but we have a test VM QCOW2 volume that gets stuck in
> >> the "Possibly undergoing healing".  By stuck I mean it stays in that
> >> state for over 24 hrs.  This is a test VM >with no activity on it and
> >> we have removed the swap file on the guest as well thinking that may
> >> be causing high I/O.  All the tools show that the VM is basically
> >> idle >with low I/O.  The only way I can clear it up is to power the
> >> VM off, move the QCOW2 volume from the Gluster mount then back
> >> (basically remove and recreate it) >then power the VM back on.  Once I do
> this process all is well again but then it happened again on the same
> volume/file.
> >>>> One additional note, I have even powered off the VM completely and
> >>>> the
> >> QCOW2 file still stays in this state.
> >>>> When this happens, can you share the output of the extended
> >>>> attributes of
> >> the file in question from all the bricks of the replica in which the file
> resides?
> >>> `getfattr -d -m . -e hex /path/to/bricks/file-name`
> >>>
> >>> Also what is the size of this VM image file?
> >>>
> >>> Thanks,
> >>> Ravi
> >>>
> >>>
> >>>
> >>>> Is there a way to stop/abort or force the heal to finish?  Any help
> >>>> with a
> >> direction would be appreciated.
> >>>> Thanks,
> >>>>
> >>>> Richard Klein
> >>>> RSI
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users at gluster.org
> >>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users at gluster.org
> >>> http://www.gluster.org/mailman/listinfo/gluster-users
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: GlusterStateDump.zip
Type: application/x-zip-compressed
Size: 16065 bytes
Desc: GlusterStateDump.zip
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160506/0f44aad7/attachment.bin>