[Gluster-users] [Gluster-devel] heal hanging

Fri Jan 22 06:05:17 UTC 2016

Hi, 

As mentioned glusterfs.get_real_filename getxattr is called when we need to 
check if a file(case insensitive) exists in a directory. 

You could run the following command to get to know the perf details. 
I guess you need to have debuginfo installed for this to work. 

Record perf: 
$perf record -a -g -p <PID of process to instrument> -o perf-glusterfsd.data 

Generate the stat: 
perf report -g -i perf-glusterfsd.data 

Attaching perf might slow down the process further, hence recommended to use it in test setup. 

Regards, 
Poornima 

----- Original Message -----

> From: "Raghavendra Talur" <rtalur at redhat.com>
> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Patrick Glomski"
> <patrick.glomski at corvidtec.com>, gluster-users at gluster.org, "David Robinson"
> <drobinson at corvidtec.com>, "Poornima Gurusiddaiah" <pgurusid at redhat.com>
> Sent: Friday, January 22, 2016 8:37:50 AM
> Subject: Re: [Gluster-devel] [Gluster-users] heal hanging

> On Jan 22, 2016 7:27 AM, "Pranith Kumar Karampuri" < pkarampu at redhat.com >
> wrote:
> >
> >
> >
> > On 01/22/2016 07:19 AM, Pranith Kumar Karampuri wrote:
> >>
> >>
> >>
> >> On 01/22/2016 07:13 AM, Glomski, Patrick wrote:
> >>>
> >>> We use the samba glusterfs virtual filesystem (the current version
> >>> provided on download.gluster.org ), but no windows clients connecting
> >>> directly.
> >>
> >>
> >> Hmm.. Is there a way to disable using this and check if the CPU% still
> >> increases? What getxattr of "glusterfs.get_real_filename <filanme>" does
> >> is to scan the entire directory looking for strcasecmp(<filname>,
> >> <scanned-filename>). If anything matches then it will return the
> >> <scanned-filename>. But the problem is the scan is costly. So I wonder if
> >> this is the reason for the CPU spikes.
> >
> > +Raghavendra Talur, +Poornima
> >
> > Raghavendra, Poornima,
> > When are these getxattrs triggered? Did you guys see any brick CPU spikes
> > before? I initially thought it could be because of big directory heals.
> > But this is happening even when no self-heals are required. So I had to
> > move away from that theory.

> These getxattrs are triggered when a SMB client performs a path based
> operation. It is necessary then that some client was connected.

> The last fix to go in that code for 3.6 was
> http://review.gluster.org/#/c/10403/ .

> I am not able to determine which release of 3.6 it made into. Will update.

> Also we would need version of Samba installed. Including the vfs plugin
> package.

> There is a for loop of strcmp involved here which does take a lot of CPU. It
> should be for short bursts though and is expected and harmless.

> >
> > Pranith
> >
> >>
> >> Pranith
> >>>
> >>>
> >>> On Thu, Jan 21, 2016 at 8:37 PM, Pranith Kumar Karampuri <
> >>> pkarampu at redhat.com > wrote:
> >>>>
> >>>> Do you have any windows clients? I see a lot of getxattr calls for
> >>>> "glusterfs.get_real_filename" which lead to full readdirs of the
> >>>> directories on the brick.
> >>>>
> >>>> Pranith
> >>>>
> >>>> On 01/22/2016 12:51 AM, Glomski, Patrick wrote:
> >>>>>
> >>>>> Pranith, could this kind of behavior be self-inflicted by us deleting
> >>>>> files directly from the bricks? We have done that in the past to clean
> >>>>> up an issues where gluster wouldn't allow us to delete from the mount.
> >>>>>
> >>>>> If so, is it feasible to clean them up by running a search on the
> >>>>> .glusterfs directories directly and removing files with a reference
> >>>>> count of 1 that are non-zero size (or directly checking the xattrs to
> >>>>> be sure that it's not a DHT link).
> >>>>>
> >>>>> find /data/brick01a/homegfs/.glusterfs -type f -not -empty -links -2
> >>>>> -exec rm -f "{}" \;
> >>>>>
> >>>>> Is there anything I'm inherently missing with that approach that will
> >>>>> further corrupt the system?
> >>>>>
> >>>>>
> >>>>> On Thu, Jan 21, 2016 at 1:02 PM, Glomski, Patrick <
> >>>>> patrick.glomski at corvidtec.com > wrote:
> >>>>>>
> >>>>>> Load spiked again: ~1200%cpu on gfs02a for glusterfsd. Crawl has been
> >>>>>> running on one of the bricks on gfs02b for 25 min or so and users
> >>>>>> cannot access the volume.
> >>>>>>
> >>>>>> I re-listed the xattrop directories as well as a 'top' entry and heal
> >>>>>> statistics. Then I restarted the gluster services on gfs02a.
> >>>>>>
> >>>>>> =================== top ===================
> >>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >>>>>> 8969 root 20 0 2815m 204m 3588 S 1181.0 0.6 591:06.93 glusterfsd
> >>>>>>
> >>>>>> =================== xattrop ===================
> >>>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-41f19453-91e4-437c-afa9-3b25614de210
> >>>>>> xattrop-9b815879-2f4d-402b-867c-a6d65087788c
> >>>>>>
> >>>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-70131855-3cfb-49af-abce-9d23f57fb393
> >>>>>> xattrop-dfb77848-a39d-4417-a725-9beca75d78c6
> >>>>>>
> >>>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> e6e47ed9-309b-42a7-8c44-28c29b9a20f8
> >>>>>> xattrop-5c797a64-bde7-4eac-b4fc-0befc632e125
> >>>>>> xattrop-38ec65a1-00b5-4544-8a6c-bf0f531a1934
> >>>>>> xattrop-ef0980ad-f074-4163-979f-16d5ef85b0a0
> >>>>>>
> >>>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-7402438d-0ee7-4fcf-b9bb-b561236f99bc
> >>>>>> xattrop-8ffbf5f7-ace3-497d-944e-93ac85241413
> >>>>>>
> >>>>>> /data/brick01a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-0115acd0-caae-4dfd-b3b4-7cc42a0ff531
> >>>>>>
> >>>>>> /data/brick02a/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-7e20fdb1-5224-4b9a-be06-568708526d70
> >>>>>>
> >>>>>> /data/brick01b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> 8034bc06-92cd-4fa5-8aaf-09039e79d2c8
> >>>>>> c9ce22ed-6d8b-471b-a111-b39e57f0b512
> >>>>>> 94fa1d60-45ad-4341-b69c-315936b51e8d
> >>>>>> xattrop-9c04623a-64ce-4f66-8b23-dbaba49119c7
> >>>>>>
> >>>>>> /data/brick02b/homegfs/.glusterfs/indices/xattrop:
> >>>>>> xattrop-b8c8f024-d038-49a2-9a53-c54ead09111d
> >>>>>>
> >>>>>>
> >>>>>> =================== heal stats ===================
> >>>>>>
> >>>>>> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 12:36:45
> >>>>>> 2016
> >>>>>> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 12:36:45
> >>>>>> 2016
> >>>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b0-gfsib01a] : No. of entries healed : 0
> >>>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 12:36:19
> >>>>>> 2016
> >>>>>> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 12:36:19
> >>>>>> 2016
> >>>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b1-gfsib01b] : No. of entries healed : 0
> >>>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1
> >>>>>>
> >>>>>> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 12:36:48
> >>>>>> 2016
> >>>>>> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 12:36:48
> >>>>>> 2016
> >>>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b2-gfsib01a] : No. of entries healed : 0
> >>>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 12:36:47
> >>>>>> 2016
> >>>>>> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 12:36:47
> >>>>>> 2016
> >>>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b3-gfsib01b] : No. of entries healed : 0
> >>>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 12:36:06
> >>>>>> 2016
> >>>>>> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 12:36:06
> >>>>>> 2016
> >>>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b4-gfsib02a] : No. of entries healed : 0
> >>>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b4-gfsib02a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 12:13:40
> >>>>>> 2016
> >>>>>> homegfs [b5-gfsib02b] : *** Crawl is in progress ***
> >>>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b5-gfsib02b] : No. of entries healed : 0
> >>>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b5-gfsib02b] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 12:36:58
> >>>>>> 2016
> >>>>>> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 12:36:58
> >>>>>> 2016
> >>>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b6-gfsib02a] : No. of entries healed : 0
> >>>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 12:36:50
> >>>>>> 2016
> >>>>>> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 12:36:50
> >>>>>> 2016
> >>>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b7-gfsib02b] : No. of entries healed : 0
> >>>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0
> >>>>>>
> >>>>>>
> >>>>>> ========================================================================================
> >>>>>> I waited a few minutes for the heals to finish and ran the heal
> >>>>>> statistics and info again. one file is in split-brain. Aside from the
> >>>>>> split-brain, the load on all systems is down now and they are
> >>>>>> behaving normally. glustershd.log is attached. What is going on???
> >>>>>>
> >>>>>> Thu Jan 21 12:53:50 EST 2016
> >>>>>>
> >>>>>> =================== homegfs ===================
> >>>>>>
> >>>>>> homegfs [b0-gfsib01a] : Starting time of crawl : Thu Jan 21 12:53:02
> >>>>>> 2016
> >>>>>> homegfs [b0-gfsib01a] : Ending time of crawl : Thu Jan 21 12:53:02
> >>>>>> 2016
> >>>>>> homegfs [b0-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b0-gfsib01a] : No. of entries healed : 0
> >>>>>> homegfs [b0-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b0-gfsib01a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b1-gfsib01b] : Starting time of crawl : Thu Jan 21 12:53:38
> >>>>>> 2016
> >>>>>> homegfs [b1-gfsib01b] : Ending time of crawl : Thu Jan 21 12:53:38
> >>>>>> 2016
> >>>>>> homegfs [b1-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b1-gfsib01b] : No. of entries healed : 0
> >>>>>> homegfs [b1-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b1-gfsib01b] : No. of heal failed entries : 1
> >>>>>>
> >>>>>> homegfs [b2-gfsib01a] : Starting time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b2-gfsib01a] : Ending time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b2-gfsib01a] : Type of crawl: INDEX
> >>>>>> homegfs [b2-gfsib01a] : No. of entries healed : 0
> >>>>>> homegfs [b2-gfsib01a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b2-gfsib01a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b3-gfsib01b] : Starting time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b3-gfsib01b] : Ending time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b3-gfsib01b] : Type of crawl: INDEX
> >>>>>> homegfs [b3-gfsib01b] : No. of entries healed : 0
> >>>>>> homegfs [b3-gfsib01b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b3-gfsib01b] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b4-gfsib02a] : Starting time of crawl : Thu Jan 21 12:53:33
> >>>>>> 2016
> >>>>>> homegfs [b4-gfsib02a] : Ending time of crawl : Thu Jan 21 12:53:33
> >>>>>> 2016
> >>>>>> homegfs [b4-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b4-gfsib02a] : No. of entries healed : 0
> >>>>>> homegfs [b4-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b4-gfsib02a] : No. of heal failed entries : 1
> >>>>>>
> >>>>>> homegfs [b5-gfsib02b] : Starting time of crawl : Thu Jan 21 12:53:14
> >>>>>> 2016
> >>>>>> homegfs [b5-gfsib02b] : Ending time of crawl : Thu Jan 21 12:53:15
> >>>>>> 2016
> >>>>>> homegfs [b5-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b5-gfsib02b] : No. of entries healed : 0
> >>>>>> homegfs [b5-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b5-gfsib02b] : No. of heal failed entries : 3
> >>>>>>
> >>>>>> homegfs [b6-gfsib02a] : Starting time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b6-gfsib02a] : Ending time of crawl : Thu Jan 21 12:53:04
> >>>>>> 2016
> >>>>>> homegfs [b6-gfsib02a] : Type of crawl: INDEX
> >>>>>> homegfs [b6-gfsib02a] : No. of entries healed : 0
> >>>>>> homegfs [b6-gfsib02a] : No. of entries in split-brain: 0
> >>>>>> homegfs [b6-gfsib02a] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> homegfs [b7-gfsib02b] : Starting time of crawl : Thu Jan 21 12:53:09
> >>>>>> 2016
> >>>>>> homegfs [b7-gfsib02b] : Ending time of crawl : Thu Jan 21 12:53:09
> >>>>>> 2016
> >>>>>> homegfs [b7-gfsib02b] : Type of crawl: INDEX
> >>>>>> homegfs [b7-gfsib02b] : No. of entries healed : 0
> >>>>>> homegfs [b7-gfsib02b] : No. of entries in split-brain: 0
> >>>>>> homegfs [b7-gfsib02b] : No. of heal failed entries : 0
> >>>>>>
> >>>>>> *** gluster bug in 'gluster volume heal homegfs statistics' ***
> >>>>>> *** Use 'gluster volume heal homegfs info' until bug is fixed ***
> >>>>>>
> >>>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>> /users/bangell/.gconfd - Is in split-brain
> >>>>>>
> >>>>>> Number of entries: 1
> >>>>>>
> >>>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>> /users/bangell/.gconfd - Is in split-brain
> >>>>>>
> >>>>>> /users/bangell/.gconfd/saved_state
> >>>>>> Number of entries: 2
> >>>>>>
> >>>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>> Number of entries: 0
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jan 21, 2016 at 11:10 AM, Pranith Kumar Karampuri <
> >>>>>> pkarampu at redhat.com > wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 01/21/2016 09:26 PM, Glomski, Patrick wrote:
> >>>>>>>>
> >>>>>>>> I should mention that the problem is not currently occurring and
> >>>>>>>> there are no heals (output appended). By restarting the gluster
> >>>>>>>> services, we can stop the crawl, which lowers the load for a while.
> >>>>>>>> Subsequent crawls seem to finish properly. For what it's worth,
> >>>>>>>> files/folders that show up in the 'volume info' output during a
> >>>>>>>> hung crawl don't seem to be anything out of the ordinary.
> >>>>>>>>
> >>>>>>>> Over the past four days, the typical time before the problem recurs
> >>>>>>>> after suppressing it in this manner is an hour. Last night when we
> >>>>>>>> reached out to you was the last time it happened and the load has
> >>>>>>>> been low since (a relief). David believes that recursively listing
> >>>>>>>> the files (ls -alR or similar) from a client mount can force the
> >>>>>>>> issue to happen, but obviously I'd rather not unless we have some
> >>>>>>>> precise thing we're looking for. Let me know if you'd like me to
> >>>>>>>> attempt to drive the system unstable like that and what I should
> >>>>>>>> look for. As it's a production system, I'd rather not leave it in
> >>>>>>>> this state for long.
> >>>>>>>
> >>>>>>>
> >>>>>>> Will it be possible to send glustershd, mount logs of the past 4
> >>>>>>> days? I would like to see if this is because of directory self-heal
> >>>>>>> going wild (Ravi is working on throttling feature for 3.8, which
> >>>>>>> will allow to put breaks on self-heal traffic)
> >>>>>>>
> >>>>>>> Pranith
> >>>>>>>
> >>>>>>>>
> >>>>>>>> [root at gfs01a xattrop]# gluster volume heal homegfs info
> >>>>>>>> Brick gfs01a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs01b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02a.corvidtec.com:/data/brick01a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02b.corvidtec.com:/data/brick01b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02a.corvidtec.com:/data/brick02a/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>> Brick gfs02b.corvidtec.com:/data/brick02b/homegfs/
> >>>>>>>> Number of entries: 0
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 21, 2016 at 10:40 AM, Pranith Kumar Karampuri <
> >>>>>>>> pkarampu at redhat.com > wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 01/21/2016 08:25 PM, Glomski, Patrick wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hello, Pranith. The typical behavior is that the %cpu on a
> >>>>>>>>>> glusterfsd process jumps to number of processor cores available
> >>>>>>>>>> (800% or 1200%, depending on the pair of nodes involved) and the
> >>>>>>>>>> load average on the machine goes very high (~20). The volume's
> >>>>>>>>>> heal statistics output shows that it is crawling one of the
> >>>>>>>>>> bricks and trying to heal, but this crawl hangs and never seems
> >>>>>>>>>> to finish.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The number of files in the xattrop directory varies over time, so
> >>>>>>>>>> I ran a wc -l as you requested periodically for some time and
> >>>>>>>>>> then started including a datestamped list of the files that were
> >>>>>>>>>> in the xattrops directory on each brick to see which were
> >>>>>>>>>> persistent. All bricks had files in the xattrop folder, so all
> >>>>>>>>>> results are attached.
> >>>>>>>>>
> >>>>>>>>> Thanks this info is helpful. I don't see a lot of files. Could you
> >>>>>>>>> give output of "gluster volume heal <volname> info"? Is there any
> >>>>>>>>> directory in there which is LARGE?
> >>>>>>>>>
> >>>>>>>>> Pranith
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Please let me know if there is anything else I can provide.
> >>>>>>>>>>
> >>>>>>>>>> Patrick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jan 21, 2016 at 12:01 AM, Pranith Kumar Karampuri <
> >>>>>>>>>> pkarampu at redhat.com > wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> hey,
> >>>>>>>>>>> Which process is consuming so much cpu? I went through the logs
> >>>>>>>>>>> you gave me. I see that the following files are in gfid mismatch
> >>>>>>>>>>> state:
> >>>>>>>>>>>
> >>>>>>>>>>> <066e4525-8f8b-43aa-b7a1-86bbcecc68b9/safebrowsing-backup>,
> >>>>>>>>>>> <1d48754b-b38c-403d-94e2-0f5c41d5f885/recovery.bak>,
> >>>>>>>>>>> <ddc92637-303a-4059-9c56-ab23b1bb6ae9/patch0008.cnvrg>,
> >>>>>>>>>>>
> >>>>>>>>>>> Could you give me the output of "ls <brick-path>/indices/xattrop
> >>>>>>>>>>> | wc -l" output on all the bricks which are acting this way?
> >>>>>>>>>>> This will tell us the number of pending self-heals on the
> >>>>>>>>>>> system.
> >>>>>>>>>>>
> >>>>>>>>>>> Pranith
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 01/20/2016 09:26 PM, David Robinson wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> resending with parsed logs...
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am having issues with 3.6.6 where the load will spike up to
> >>>>>>>>>>>>>> 800% for one of the glusterfsd processes and the users can no
> >>>>>>>>>>>>>> longer access the system. If I reboot the node, the heal will
> >>>>>>>>>>>>>> finish normally after a few minutes and the system will be
> >>>>>>>>>>>>>> responsive, but a few hours later the issue will start again.
> >>>>>>>>>>>>>> It look like it is hanging in a heal and spinning up the load
> >>>>>>>>>>>>>> on one of the bricks. The heal gets stuck and says it is
> >>>>>>>>>>>>>> crawling and never returns. After a few minutes of the heal
> >>>>>>>>>>>>>> saying it is crawling, the load spikes up and the mounts
> >>>>>>>>>>>>>> become unresponsive.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any suggestions on how to fix this? It has us stopped cold as
> >>>>>>>>>>>>>> the user can no longer access the systems when the load
> >>>>>>>>>>>>>> spikes... Logs attached.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> System setup info is:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [root at gfs01a ~]# gluster volume info homegfs
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Volume Name: homegfs
> >>>>>>>>>>>>>> Type: Distributed-Replicate
> >>>>>>>>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> >>>>>>>>>>>>>> Status: Started
> >>>>>>>>>>>>>> Number of Bricks: 4 x 2 = 8
> >>>>>>>>>>>>>> Transport-type: tcp
> >>>>>>>>>>>>>> Bricks:
> >>>>>>>>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>>>>>>>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>>>>>>>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>>>>>>>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>>>>>>>>>>>>> Options Reconfigured:
> >>>>>>>>>>>>>> performance.io-thread-count: 32
> >>>>>>>>>>>>>> performance.cache-size: 128MB
> >>>>>>>>>>>>>> performance.write-behind-window-size: 128MB
> >>>>>>>>>>>>>> server.allow-insecure: on
> >>>>>>>>>>>>>> network.ping-timeout: 42
> >>>>>>>>>>>>>> storage.owner-gid: 100
> >>>>>>>>>>>>>> geo-replication.indexing: off
> >>>>>>>>>>>>>> geo-replication.ignore-pid-check: on
> >>>>>>>>>>>>>> changelog.changelog: off
> >>>>>>>>>>>>>> changelog.fsync-interval: 3
> >>>>>>>>>>>>>> changelog.rollover-time: 15
> >>>>>>>>>>>>>> server.manage-gids: on
> >>>>>>>>>>>>>> diagnostics.client-log-level: WARNING
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [root at gfs01a ~]# rpm -qa | grep gluster
> >>>>>>>>>>>>>> gluster-nagios-common-0.1.1-0.el6.noarch
> >>>>>>>>>>>>>> glusterfs-fuse-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-debuginfo-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-libs-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-geo-replication-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-api-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-devel-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-api-devel-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-cli-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-rdma-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-server-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>> glusterfs-extra-xlators-3.6.6-1.el6.x86_64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Gluster-devel mailing list
> >>>>>>>>>>>> Gluster-devel at gluster.org
> >>>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Gluster-users mailing list
> >>>>>>>>>>> Gluster-users at gluster.org
> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160122/40f836e5/attachment.html>